DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

2406.04197

Published 6/7/2024 by Shangqing Tu, Kejian Zhu, Yushi Bai, Zijun Yao, Lei Hou, Juanzi Li

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

Abstract

The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. In this work, we argue that even training on data similar to benchmark data inflates performance on in-distribution tasks without improving overall capacity, which we called In-distribution contamination. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. DICE first identifies the most sensitive layer to contamination, then trains a classifier based on the internal states of that layer. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets. We also show the generalization capability of the trained DICE detector, which is able to detect contamination across multiple benchmarks with similar distributions. Additionally, we find that the DICE detection scores are positively correlated with the performance of ten LLMs fine-tuned by either us or other organizations on four math reasoning datasets (with $R^2$ values between 0.6 and 0.75). This indicates that the in-distribution contamination problem potentially lead to an overestimation of the true capabilities of many existing models. The code and data are available at https://github.com/THU-KEG/DICE.

Create account to get full access

Overview

The paper "DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning" explores the issue of data contamination when fine-tuning large language models (LLMs) on specialized tasks, such as math reasoning.
The authors propose a novel technique called DICE (Detecting In-distribution Contamination) to identify and mitigate the impact of in-distribution contamination during the fine-tuning phase.
The research aims to improve the reliability and trustworthiness of LLMs in specialized domains by addressing the challenges posed by data contamination.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can perform a wide range of natural language tasks. However, when these models are fine-tuned on specialized tasks, such as solving math problems, there is a risk of "data contamination." This means that the training data used for fine-tuning may contain information that is already present in the model's pre-training data, which can lead to overfitting and unreliable performance.

The DICE technique proposed in this paper helps to detect and mitigate the impact of this data contamination during the fine-tuning process. By identifying the contaminated data, the researchers can ensure that the fine-tuned model is truly learning the desired task and not just relying on information it already knew from its pre-training. This is important for ensuring the trustworthiness and reliability of LLMs in specialized domains, such as math reasoning.

Technical Explanation

The key components of the DICE technique include:

In-distribution Contamination Detection: The authors develop a method to detect the presence of in-distribution contamination in the fine-tuning data by leveraging the knowledge learned during pre-training.
Contamination Mitigation: Based on the detected contamination, the researchers propose strategies to mitigate its impact, such as selective fine-tuning or adversarial training.

The paper presents experiments on several math reasoning benchmarks, including MATH and GSM8K, to evaluate the effectiveness of the DICE technique. The results demonstrate that DICE can significantly improve the out-of-distribution generalization of fine-tuned LLMs, making them more reliable and trustworthy for specialized tasks.

Critical Analysis

The paper provides a valuable contribution to the field of LLM fine-tuning by addressing the critical issue of data contamination. The DICE technique offers a promising approach to detect and mitigate the impact of in-distribution contamination, which is an important step towards ensuring the trustworthiness and reliability of LLMs in specialized domains.

However, the paper also acknowledges some limitations and areas for further research. For instance, the authors note that DICE may not be able to fully eliminate the impact of contamination, and there may be cases where the contaminated data is not easily identifiable. Additionally, the proposed mitigation strategies, such as selective fine-tuning, may have their own trade-offs that need to be carefully considered.

Future research could explore more advanced techniques for detecting and mitigating data contamination, as well as investigating the impact of contamination on a wider range of specialized tasks beyond math reasoning.

Conclusion

The "DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning" paper presents a novel approach to address the challenge of data contamination in the fine-tuning of large language models. By identifying and mitigating the impact of in-distribution contamination, the DICE technique can help improve the reliability and trustworthiness of LLMs in specialized domains, such as math reasoning. This research is an important step towards ensuring that fine-tuned LLMs can be deployed with confidence in real-world applications that require specialized capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, Ge Li

Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8%-30.2% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect implicit contamination. TED substantially mitigates performance improvements up to 66.9% attributed to data contamination across various contamination setups. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.

6/3/2024

cs.CL cs.AI cs.CR cs.LG cs.SE

Data Contamination Can Cross Language Barriers

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from url{https://github.com/ShangDataLab/Deep-Contam}.

6/21/2024

cs.CL cs.AI

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Qin Zhu, Qingyuan Cheng, Runyu Peng, Xiaonan Li, Tengxiao Liu, Ru Peng, Xipeng Qiu, Xuanjing Huang

The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.

6/26/2024

cs.CL

📊

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named textbf{T}estset textbf{S}lot Guessing (textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52% and 57%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.

4/5/2024

cs.CL cs.AI