MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference

Read original: arXiv:2404.08066 - Published 4/15/2024 by Mobashir Sadat, Cornelia Caragea

🌿

Overview

Describes a new dataset called MSciNLI for the task of scientific Natural Language Inference (NLI)
MSciNLI contains 132,320 sentence pairs from 5 new scientific domains, expanding on the existing SciNLI dataset
Establishes strong baselines using fine-tuned Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
Demonstrates the challenge of MSciNLI for both types of models, and that domain shift degrades performance

Plain English Explanation

The paper introduces a new dataset called MSciNLI for the task of scientific Natural Language Inference (NLI). NLI is the task of predicting the semantic relationship between two sentences, such as whether one sentence entails, contradicts, or is neutral with respect to another.

The researchers created MSciNLI by extracting 132,320 sentence pairs from research papers across 5 new scientific domains, expanding on the existing SciNLI dataset which was limited to the computational linguistics domain. This diversity allows for the study of how models perform across different scientific fields.

The researchers then established strong baseline performance on MSciNLI using two types of models: fine-tuned Pre-trained Language Models (PLMs) and prompted Large Language Models (LLMs). They found that MSciNLI is challenging for both types of models, with the best PLM and LLM baselines achieving Macro F1 scores of 77.21% and 51.77% respectively.

Furthermore, the researchers showed that a model's performance degrades when applied to a different scientific domain, demonstrating the diverse characteristics of the domains represented in MSciNLI. Finally, they showed that using MSciNLI and SciNLI in an intermediate task transfer learning setting can improve performance on downstream tasks in the scientific domain.

Technical Explanation

The paper introduces a new dataset called MSciNLI for the task of scientific Natural Language Inference (NLI). The NLI task involves predicting the semantic relationship (entailment, contradiction, or neutrality) between two sentences extracted from research articles.

The researchers created MSciNLI by extracting 132,320 sentence pairs from research papers across 5 new scientific domains: biology, chemistry, computer science, materials science, and physics. This expands on the existing SciNLI dataset, which was limited to the computational linguistics domain.

The researchers established strong baseline performance on MSciNLI using two types of models:

Pre-trained Language Models (PLMs): The researchers fine-tuned PLMs such as BERT, RoBERTa, and ELECTRA on the MSciNLI dataset, achieving the highest Macro F1 score of 77.21%.
Large Language Models (LLMs): The researchers prompted LLMs such as GPT-3 and PaLM on the MSciNLI task, achieving the highest Macro F1 score of 51.77%.

The results demonstrate that MSciNLI is a challenging dataset for both types of models, illustrating the diversity and complexity of the scientific NLI task.

Furthermore, the researchers showed that domain shift degrades the performance of scientific NLI models, indicating the diverse characteristics of the different scientific domains represented in the dataset.

Finally, the researchers used both the SciNLI and MSciNLI datasets in an intermediate task transfer learning setting, and showed that this can improve the performance of downstream tasks in the scientific domain.

Critical Analysis

The paper presents a valuable contribution to the field of scientific NLI by introducing the MSciNLI dataset, which expands the diversity of domains beyond the existing SciNLI dataset. This allows for the study of domain shift and the development of more robust models that can generalize across scientific fields.

However, the paper does not delve into potential limitations or caveats of the dataset or the baseline models. For example, it would be helpful to understand the potential biases or skewed distributions within the dataset, and how that might impact model performance.

Additionally, the paper could have discussed potential challenges in annotating the sentence pairs or in deriving the dataset from research papers. Understanding these challenges could provide insights into improving the dataset or the NLI task itself.

Finally, the paper could have explored more avenues for further research, such as investigating the use of domain-specific knowledge or multimodal information (e.g., incorporating figures and tables) to improve scientific NLI performance.

Conclusion

The paper introduces a new dataset called MSciNLI for the task of scientific Natural Language Inference (NLI), which expands on the existing SciNLI dataset by including sentence pairs from 5 new scientific domains. The researchers establish strong baselines using fine-tuned Pre-trained Language Models and prompted Large Language Models, demonstrating the challenging nature of the task.

The paper's key contribution is the expansion of the scientific NLI task to a more diverse set of domains, enabling the study of domain shift and the development of more robust models. The findings suggest that scientific NLI remains a challenging problem, and that further research is needed to improve model performance and generalization across scientific fields.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference

Mobashir Sadat, Cornelia Caragea

The task of scientific Natural Language Inference (NLI) involves predicting the semantic relation between two sentences extracted from research articles. This task was recently proposed along with a new dataset called SciNLI derived from papers published in the computational linguistics domain. In this paper, we aim to introduce diversity in the scientific NLI task and present MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains. The availability of multiple domains makes it possible to study domain shift for scientific NLI. We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs). The highest Macro F1 scores of PLM and LLM baselines are 77.21% and 51.77%, respectively, illustrating that MSciNLI is challenging for both types of models. Furthermore, we show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset. Finally, we use both scientific NLI datasets in an intermediate task transfer learning setting and show that they can improve the performance of downstream tasks in the scientific domain. We make our dataset and code available on Github.

4/15/2024

A synthetic data approach for domain generalization of NLI models

Mohammad Javad Hosseini, Andrey Petrov, Alex Fabrikant, Annie Louis

Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We explore the opportunity for synthetic high-quality datasets to adapt NLI models for zero-shot use in downstream applications across new and unseen text domains. We demonstrate a new approach for generating NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data ($685$K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around $7%$ on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.

7/1/2024

Co-training for Low Resource Scientific Natural Language Inference

Mobashir Sadat, Cornelia Caragea

Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. The automatic annotation method based on distant supervision for the training set of SciNLI (Sadat and Caragea, 2022b), the first and most popular dataset for this task, results in label noise which inevitably degenerates the performance of classifiers. In this paper, we propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels, reflective of the manner they are used in the subsequent training epochs. That is, unlike the existing semi-supervised learning (SSL) approaches, we consider the historical behavior of the classifiers to evaluate the quality of the automatically annotated labels. Furthermore, by assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data, while ensuring that the noisy labels have a minimal impact on model training. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines. We make our code and data available on Github.

6/24/2024

Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding

Balaji Muralidharan, Hayden Beadles, Reza Marzban, Kalyan Sashank Mupparaju

This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains and to create a deep learning framework: Knowledge AI. As a part of this framework, we employ pre-trained models and fine-tune them on datasets in the scientific domain. The models are adapted for four key Natural Language Processing (NLP) tasks: summarization, text generation, question answering, and named entity recognition. Our results indicate that domain-specific fine-tuning significantly enhances model performance in each of these tasks, thereby improving their applicability for scientific contexts. This adaptation enables non-experts to efficiently query and extract information within targeted scientific fields, demonstrating the potential of fine-tuned LLMs as a tool for knowledge discovery in the sciences.

8/12/2024