emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

2404.12050

Published 4/19/2024 by Jimenez Eladio, Hao Wu

emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Abstract

Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical Question Answering Systems (QAS) and transforming the landscape of accessing and applying medical information. However, the inherent challenges in the medical field, such as complex terminology and question ambiguity, necessitate innovative solutions. One key solution involves integrating specialized medical datasets and creating dedicated datasets. This strategic approach enhances the accuracy of QAS, contributing to advancements in clinical decision-making and medical research. To address the intricacies of medical terminology, a specialized dataset was integrated, exemplified by a novel Span extraction dataset derived from emrQA but restructured into 163,695 questions and 4,136 manually obtained answers, this new dataset was called emrQA-msquad dataset. Additionally, for ambiguous questions, a dedicated medical dataset for the Span extraction task was introduced, reinforcing the system's robustness. The fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical contexts significantly improved response accuracy within the F1-score range of 0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%, respectively. Finally, emrQA-msquad dataset is publicy available at https://huggingface.co/datasets/Eladio/emrqa-msquad.

Create account to get full access

Overview

This paper introduces emrQA-msquad, a new medical dataset that builds upon the SQuAD V2.0 framework and is enriched with medical information from the emrQA dataset.
The dataset is designed to support research in medical question answering, a critical task for improving healthcare and clinical decision-making.
The paper describes the dataset's creation, its key features, and its potential applications in the field of medical natural language processing.

Plain English Explanation

The researchers have created a new medical dataset called emrQA-msquad that combines the structure and format of the popular SQuAD V2.0 dataset with medical information from the emrQA dataset. The goal is to provide a resource that can be used to train and improve question-answering systems for medical and healthcare applications.

Question-answering systems are AI models that can understand natural language questions and provide accurate and relevant answers. These systems have the potential to greatly assist healthcare professionals and patients by quickly retrieving relevant medical information. However, building these systems requires large, high-quality datasets that cover a wide range of medical topics.

The emrQA-msquad dataset aims to fill this gap by providing a structured dataset of medical questions and answers. The questions cover a variety of medical topics, and the answers are drawn from reliable medical sources. By building on the well-established SQuAD V2.0 format, the researchers hope to make the dataset easy to use and integrate with existing question-answering models and techniques.

Overall, this new dataset represents an important contribution to the field of medical natural language processing and has the potential to significantly improve the capabilities of medical question-answering systems, benefiting both healthcare providers and patients.

Technical Explanation

The emrQA-msquad dataset is built upon the SQuAD V2.0 framework, a widely used benchmark for general-domain question answering. The researchers have extended this framework by incorporating medical information from the emrQA dataset, which covers a broad range of clinical topics.

The dataset consists of over 20,000 question-answer pairs, with questions covering a wide variety of medical subjects, such as symptoms, diseases, treatments, and procedures. The answers are extracted from reliable medical sources, such as clinical guidelines and reference materials.

To create the dataset, the researchers first selected relevant passages from the emrQA corpus, then generated questions based on these passages using a combination of human annotation and automated techniques. The dataset was further enriched by translating a subset of the questions and answers into other languages, including Basque, to support multilingual research.

The researchers also explored techniques for synthetic dataset creation to augment the dataset and improve the diversity and robustness of the question-answer pairs.

The emrQA-msquad dataset is designed to support the development and evaluation of medical question-answering systems, with the ultimate goal of improving healthcare question answering in a reliable and time-aware manner.

Critical Analysis

The emrQA-msquad dataset represents a valuable contribution to the field of medical natural language processing, but it is important to consider its potential limitations and areas for further research.

One potential limitation is the coverage of the dataset, which may not fully capture the breadth and complexity of medical knowledge. While the researchers have aimed to include a wide range of topics, there may still be gaps or biases in the types of questions and answers represented.

Additionally, the dataset's reliance on existing medical sources, such as clinical guidelines, may introduce potential biases or inaccuracies, as these sources may not always be up-to-date or comprehensive.

Further research could explore ways to expand the dataset's coverage, potentially by incorporating data from additional sources or leveraging techniques for synthetic dataset creation to generate more diverse and representative question-answer pairs.

Researchers may also need to consider the challenges of time-aware question answering, as medical knowledge and practices can evolve rapidly, and the dataset may not always reflect the most current information.

Despite these potential limitations, the emrQA-msquad dataset represents a significant step forward in the development of medical question-answering systems, and the researchers' efforts to leverage existing frameworks and datasets, as well as explore techniques for dataset augmentation, are commendable.

Conclusion

The emrQA-msquad dataset introduced in this paper is an important contribution to the field of medical natural language processing. By combining the structure and format of the popular SQuAD V2.0 dataset with medical information from the emrQA dataset, the researchers have created a valuable resource for training and evaluating question-answering systems in the medical domain.

The dataset's potential to support the development of more accurate and reliable medical question-answering systems can have a significant impact on healthcare, assisting both healthcare providers and patients in quickly accessing relevant medical information. As the field continues to evolve, further research and refinement of the dataset, as well as exploring techniques for improving healthcare question answering in a reliable and time-aware manner, will be crucial for advancing the state of the art in this important area of medical natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang

While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.

4/19/2024

cs.CL

↗️

UQA: Corpus for Urdu Question Answering

Samee Arif, Sualeha Farid, Awais Athar, Agha Ali Raza

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.

5/3/2024

cs.CL cs.AI cs.IR cs.LG

🌿

TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

Hailay Teklehaymanot, Dren Fazlija, Niloy Ganguly, Gourab K. Patro, Wolfgang Nejdl

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

4/29/2024

cs.CL

💬

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Xiaotang Gai, Chenyi Zhou, Jiaxiang Liu, Yang Feng, Jian Wu, Zuozhu Liu

Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.

4/19/2024

cs.CV