D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models






Published 5/8/2024 by Duygu Altinok



Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.

Create account to get full access


If you already have an account, we'll log you in


  • Large language models (LLMs) have become increasingly popular and widely used due to their impressive performance on various tasks.
  • However, LLMs face challenges such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning.
  • Evaluating LLMs' capabilities in miscellaneous reasoning tasks is an active area of research.
  • Prior to LLMs, Transformers had already proven successful in the medical domain for natural language understanding (NLU) tasks.
  • LLMs have also been trained and utilized in the medical domain, raising concerns about factual accuracy, safety, and inherent limitations.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. They have become very popular because they can perform remarkably well on a wide range of tasks, such as summarizing articles, answering questions, and even generating creative content.

However, LLMs are not perfect. They can sometimes make up information that isn't true (hallucinations), provide inconsistent facts, and struggle with tasks that require numerical or quantitative reasoning. Researchers are actively studying how to better evaluate and improve these capabilities.

Before the rise of LLMs, another type of AI model called Transformers had already shown promise in the medical field, helping with tasks like understanding medical documents. Now, LLMs are also being used in healthcare, which raises concerns about whether they can be accurate and safe enough for sensitive medical applications.

This paper focuses on evaluating how well popular LLMs can perform a specific task: understanding the meaning and implications of clinical trial reports. The researchers tested several LLMs and found that their leading model, Gemini, achieved a good but not perfect score on this task. This study is the first of its kind to closely examine LLMs' abilities in the medical domain.

Technical Explanation

The researchers in this paper evaluated the natural language inference capabilities of various open-source and closed-source large language models (LLMs) using clinical trial reports as the dataset. Natural language inference is the task of determining whether a given statement can be inferred from a given premise.

They tested the performance of each LLM on a test set and further analyzed their performance on a development set, with a focus on challenging instances involving medical abbreviations and numerical-quantitative reasoning. Their leading LLM, called Gemini, achieved an F1-score of 0.748 on the test set, ranking ninth on the task scoreboard.

This study is the first of its kind to provide a comprehensive examination of LLMs' inference capabilities within the medical domain. Prior to the breakthrough of LLMs, Transformers had already demonstrated success in the medical field for various natural language understanding (NLU) tasks, as shown in related research. However, the increasing use of LLMs in healthcare has raised concerns about their factual accuracy, adherence to safety protocols, and inherent limitations, which this paper aims to address.

Critical Analysis

The paper provides a valuable contribution to the field by focusing on the critical issue of evaluating LLMs' performance in the medical domain, where factual accuracy and safety are of utmost importance. The researchers' approach of using clinical trial reports as the dataset is well-chosen, as these documents contain complex medical terminology and require a deep understanding of the subject matter.

However, the paper does not delve into the specific limitations or potential biases of the LLMs tested. It would be helpful to understand the types of errors or inconsistencies these models exhibit, particularly in the context of medical information. Additionally, the paper could have addressed the broader implications of using LLMs in healthcare, such as the ethical considerations around relying on these models for critical decision-making processes.

The researchers also acknowledge that evaluating LLMs' numerical-quantitative reasoning capabilities remains a challenge, as evidenced by the performance issues observed in their study. Further research is needed to address this limitation and ensure that LLMs can handle the complex numerical aspects often present in medical data and decision-making.


This study represents an important step in understanding the capabilities and limitations of large language models (LLMs) in the medical domain. By evaluating the natural language inference abilities of popular LLMs using clinical trial reports, the researchers have provided valuable insights into the current state of these models' performance in a critical field.

The findings highlight the need for continued efforts to improve LLMs' factual accuracy, safety, and numerical-quantitative reasoning capabilities, particularly as they are increasingly adopted in healthcare applications. As the use of LLMs in sensitive domains expands, it is crucial to thoroughly assess their limitations and ensure they can be relied upon to provide reliable and trustworthy information.

This study serves as a foundation for future research in this area, paving the way for the development of more robust and domain-specific LLMs that can safely and effectively assist medical professionals and patients.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical Trials

IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical Trials

Shreyasi Mandal, Ashutosh Modi





Large Language models (LLMs) have demonstrated state-of-the-art performance in various natural language processing (NLP) tasks across multiple domains, yet they are prone to shortcut learning and factual inconsistencies. This research investigates LLMs' robustness, consistency, and faithful reasoning when performing Natural Language Inference (NLI) on breast cancer Clinical Trial Reports (CTRs) in the context of SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We examine the reasoning capabilities of LLMs and their adeptness at logical problem-solving. A comparative analysis is conducted on pre-trained language models (PLMs), GPT-3.5, and Gemini Pro under zero-shot settings using Retrieval-Augmented Generation (RAG) framework, integrating various reasoning chains. The evaluation yields an F1 score of 0.69, consistency of 0.71, and a faithfulness score of 0.90 on the test dataset.

Read more


A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang





Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through continued training of open-source general LLMs, which require significantly fewer computational resources than training LLMs from scratch. Additionally, this approach offers better protection of patient privacy compared to API-based solutions. This survey systematically explores how to train medical LLMs based on general LLMs. It covers: (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose a appropriate training paradigm, (c) how to choose a suitable evaluation benchmark, and (d) existing challenges and promising future research directions are discussed. This survey can provide guidance for the development of LLMs focused on various medical applications, such as medical education, diagnostic planning, and clinical assistants.

Read more



Evaluating large language models in medical applications: a survey

Xiaolan Chen, Jiayang Xiang, Shanfu Lu, Yexin Liu, Mingguang He, Danli Shi





Large language models (LLMs) have emerged as powerful tools with transformative potential across numerous domains, including healthcare and medicine. In the medical domain, LLMs hold promise for tasks ranging from clinical decision support to patient education. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medical information. This paper provides a comprehensive overview of the landscape of medical LLM evaluation, synthesizing insights from existing studies and highlighting evaluation data sources, task scenarios, and evaluation methods. Additionally, it identifies key challenges and opportunities in medical LLM evaluation, emphasizing the need for continued research and innovation to ensure the responsible integration of LLMs into clinical practice.

Read more


SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Mael Jullien, Marco Valentino, Andr'e Freitas





Large Language Models (LLMs) are at the forefront of NLP achievements but fall short in dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs.These shortcomings are especially critical in medical contexts, where they can misrepresent actual model capabilities. Addressing this, we present SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for ClinicalTrials. Our contributions include the refined NLI4CT-P dataset (i.e., Natural Language Inference for Clinical Trials - Perturbed), designed to challenge LLMs with interventional and causal reasoning tasks, along with a comprehensive evaluation of methods and results for participant submissions. A total of 106 participants registered for the task contributing to over 1200 individual submissions and 25 system overview papers. This initiative aims to advance the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making. We anticipate that the dataset, models, and outcomes of this task can support future research in the field of biomedical NLI. The dataset, competition leaderboard, and website are publicly available.

Read more
