DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training

2405.00321

Published 5/2/2024 by Bhuvanesh Verma, Lisa Raithel

📊

Abstract

The NLI4CT task at SemEval-2024 emphasizes the development of robust models for Natural Language Inference on Clinical Trial Reports (CTRs) using large language models (LLMs). This edition introduces interventions specifically targeting the numerical, vocabulary, and semantic aspects of CTRs. Our proposed system harnesses the capabilities of the state-of-the-art Mistral model, complemented by an auxiliary model, to focus on the intricate input space of the NLI4CT dataset. Through the incorporation of numerical and acronym-based perturbations to the data, we train a robust system capable of handling both semantic-altering and numerical contradiction interventions. Our analysis on the dataset sheds light on the challenging sections of the CTRs for reasoning.

Create account to get full access

Overview

The NLI4CT task at SemEval-2024 focuses on developing robust models for Natural Language Inference on Clinical Trial Reports (CTRs) using large language models (LLMs).
This edition introduces interventions targeting the numerical, vocabulary, and semantic aspects of CTRs.
The proposed system uses the state-of-the-art Mistral model and an auxiliary model to handle the complex input space of the NLI4CT dataset.
The system is trained with numerical and acronym-based perturbations to handle both semantic-altering and numerical contradiction interventions.
The analysis on the dataset provides insights into the challenging sections of the CTRs for reasoning.

Plain English Explanation

The NLI4CT task at SemEval-2024 is focused on developing AI models that can understand and reason about the information in clinical trial reports. These reports contain a lot of technical and numerical information, which can be challenging for language models to handle.

The researchers propose a system that combines a powerful language model called Mistral with an additional model to help the system better understand the numerical, vocabulary, and semantic aspects of the clinical trial reports. They train the system to handle different types of changes or "interventions" to the text, such as changes to the numbers or the use of medical abbreviations.

By analyzing the model's performance on the dataset, the researchers can identify the specific sections of the clinical trial reports that are the most challenging for the AI to reason about. This helps them understand where more work is needed to improve the model's understanding of this type of technical content.

Technical Explanation

The NLI4CT task at SemEval-2024 focuses on developing robust models for Natural Language Inference on Clinical Trial Reports (CTRs) using large language models (LLMs). This edition introduces interventions specifically targeting the numerical, vocabulary, and semantic aspects of CTRs.

The proposed system harnesses the capabilities of the state-of-the-art Mistral model, complemented by an auxiliary model, to focus on the intricate input space of the NLI4CT dataset. Through the incorporation of numerical and acronym-based perturbations to the data, the researchers train a robust system capable of handling both semantic-altering and numerical contradiction interventions.

The analysis on the dataset sheds light on the challenging sections of the CTRs for reasoning, providing insights into the areas that require further advancements in language understanding and reasoning for technical domains.

Critical Analysis

The paper provides a comprehensive approach to addressing the challenges of natural language inference on clinical trial reports. By incorporating targeted interventions and leveraging the capabilities of state-of-the-art language models, the researchers have developed a robust system that can handle the complexities of this domain.

However, the analysis of the model's performance on the dataset raises questions about the generalizability of the approach. While the system's ability to handle numerical and vocabulary-based interventions is encouraging, it is unclear how well it would perform on more nuanced or context-dependent inferences that may be crucial for real-world clinical decision-making.

Additionally, the paper does not provide a detailed discussion of the potential biases or limitations of the language models used, which could impact the model's reliability and fairness when applied to diverse clinical trial datasets. Further research is needed to explore these aspects and ensure the safe and ethical deployment of such systems in clinical settings.

Conclusion

The NLI4CT task at SemEval-2024 highlights the importance of developing robust natural language understanding systems for the clinical domain. The proposed approach, which combines a state-of-the-art language model with targeted interventions, demonstrates promising results in handling the numerical, vocabulary, and semantic challenges of clinical trial reports.

By shedding light on the specific areas of difficulty in this domain, the research paves the way for further advancements in language understanding and reasoning, ultimately contributing to the development of more reliable and trustworthy AI systems for clinical decision-making and patient care.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Mael Jullien, Marco Valentino, Andr'e Freitas

Large Language Models (LLMs) are at the forefront of NLP achievements but fall short in dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs.These shortcomings are especially critical in medical contexts, where they can misrepresent actual model capabilities. Addressing this, we present SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for ClinicalTrials. Our contributions include the refined NLI4CT-P dataset (i.e., Natural Language Inference for Clinical Trials - Perturbed), designed to challenge LLMs with interventional and causal reasoning tasks, along with a comprehensive evaluation of methods and results for participant submissions. A total of 106 participants registered for the task contributing to over 1200 individual submissions and 25 system overview papers. This initiative aims to advance the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making. We anticipate that the dataset, models, and outcomes of this task can support future research in the field of biomedical NLI. The dataset, competition leaderboard, and website are publicly available.

4/9/2024

cs.CL cs.AI

DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness

Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De

Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.

4/16/2024

cs.CL

IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical Trials

Shreyasi Mandal, Ashutosh Modi

Large Language models (LLMs) have demonstrated state-of-the-art performance in various natural language processing (NLP) tasks across multiple domains, yet they are prone to shortcut learning and factual inconsistencies. This research investigates LLMs' robustness, consistency, and faithful reasoning when performing Natural Language Inference (NLI) on breast cancer Clinical Trial Reports (CTRs) in the context of SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We examine the reasoning capabilities of LLMs and their adeptness at logical problem-solving. A comparative analysis is conducted on pre-trained language models (PLMs), GPT-3.5, and Gemini Pro under zero-shot settings using Retrieval-Augmented Generation (RAG) framework, integrating various reasoning chains. The evaluation yields an F1 score of 0.69, consistency of 0.71, and a faithfulness score of 0.90 on the test dataset.

4/9/2024

cs.CL cs.AI cs.LG

🤯

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

Duygu Altinok

Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.

5/8/2024

cs.CL