Stronger Random Baselines for In-Context Learning

Read original: arXiv:2404.13020 - Published 4/22/2024 by Gregory Yauney, David Mimno

🛸

Overview

Evaluating the performance of language models on text classification tasks is challenging due to small datasets, extensive validation set use, and intentionally difficult tasks.
The standard random baseline (uniform guessing) is often not a reliable measure, especially for small datasets and reused validation sets.
The paper proposes a stronger random baseline - the expected maximum accuracy across multiple random classifiers - to better assess language model performance.

Plain English Explanation

The paper discusses the challenges of evaluating how well language models can learn to classify text. Classifying text, like deciding if an email is spam or not, is an important task for language models. But testing how well they do this is tricky.

The typical way to test is to have the model try to classify a set of text samples, and see how accurate it is. The "random baseline" is how accurate the model would be if it just guessed the classifications randomly. This baseline is usually stable and reliable.

However, the researchers found that this random baseline isn't always a good measure, especially when the test dataset is small or has been used a lot during development. In these cases, the random baseline can be too low, making the language model look better than it really is.

To fix this, the researchers propose a "stronger random baseline" - the expected best accuracy if you had multiple random classifiers and picked the most accurate one. This provides a more realistic baseline to compare language models against, especially for small datasets or heavily-used validation sets. This stronger baseline helps avoid incorrectly thinking a language model has learned something significant when it's really just doing a bit better than pure guessing.

Technical Explanation

The paper evaluates the challenges of assessing the "in-context learning" performance of language models on text classification tasks. In-context learning refers to a language model's ability to learn a new task by observing just a few examples, without extensive fine-tuning.

The researchers identify three key issues that complicate this evaluation:

Small dataset sizes, which make it hard to get reliable performance estimates
Extensive use of validation sets to choose the best prompt demonstrations, which can lead to overfitting
Intentionally difficult tasks that result in near-random performance

To address these issues, the paper proposes using a "maximum random baseline" instead of the standard random baseline. This baseline represents the expected accuracy of the best-performing random classifier out of multiple trials, rather than just a single random classifier.

The researchers show that this maximum random baseline is more stable and a better predictor of held-out test performance, especially for small datasets and extensively reused validation sets. When evaluating language models on 16 text classification tasks, they find that over 20% of the few-shot results that beat the standard random baseline do not actually beat the stronger maximum random baseline.

The maximum random baseline provides a straightforward, drop-in replacement for the standard random baseline that better accounts for common evaluation practices and dataset limitations. This helps avoid overestimating the in-context learning capabilities of language models.

Critical Analysis

The paper makes a compelling case for using a stronger random baseline to more accurately assess language model performance on text classification tasks. The proposed maximum random baseline addresses several well-known issues with standard evaluation practices, such as small datasets and validation set reuse.

One potential limitation is that the maximum random baseline may not generalize well to tasks or datasets that differ significantly from those examined in the paper. The researchers tested it on a specific set of BIG-bench Lite tasks, so further validation on a wider range of benchmarks would help establish its broader applicability.

Additionally, the paper does not explore how the maximum random baseline might interact with other proposed evaluation methodologies, such as contrast sets or uncertainty quantification. Combining these approaches could lead to even more robust and informative language model evaluations.

Overall, the maximum random baseline is a well-motivated and practical solution to a common issue in language model benchmarking. Adopting this baseline could lead to more realistic assessments of in-context learning capabilities and help avoid overconfidence in the current state of the art.

Conclusion

This paper presents a new approach to evaluating the in-context learning performance of language models on text classification tasks. By proposing a "maximum random baseline" that accounts for small datasets and validation set reuse, the researchers offer a more reliable way to assess the true capabilities of these models.

Adopting this baseline could have important implications for the field of natural language processing, leading to more accurate assessments of language model progress and avoiding the pitfalls of overly optimistic performance claims. As the capabilities of these models continue to advance, robust and reliable evaluation methods will be crucial for driving the field forward in a responsible and meaningful way.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

Stronger Random Baselines for In-Context Learning

Gregory Yauney, David Mimno

Evaluating the in-context learning classification performance of language models poses challenges due to small dataset sizes, extensive prompt-selection using the validation set, and intentionally difficult tasks that lead to near-random performance. The standard random baseline -- the expected accuracy of guessing labels uniformly at random -- is stable when the evaluation set is used only once or when the dataset is large. We account for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers. When choosing the best prompt demonstrations across six quantized language models applied to 16 BIG-bench Lite tasks, more than 20% of the few-shot results that exceed the standard baseline do not exceed this stronger random baseline. When held-out test sets are available, this stronger baseline is also a better predictor of held-out performance than the standard baseline, avoiding unnecessary test set evaluations. This maximum random baseline provides an easily calculated drop-in replacement for the standard baseline.

4/22/2024

Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements

Anton Voronov, Lena Wolf, Max Ryabinin

Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across 21 models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.

6/10/2024

A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness

Branislav Pecher, Ivan Srba, Maria Bielikova

Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variances in results across training runs. When such sensitivity is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 415 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received an appropriate level of attention, and reveal impactful directions for future research.

9/4/2024

Language Models for Text Classification: Is In-Context Learning Enough?

Aleksandra Edwards, Jose Camacho-Collados

Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches based on fine-tuning is the ability to understand instructions written in natural language (prompts), which helps them generalise better to different tasks and domains without the need for specific training data. This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances. However, existing research is limited in scale and lacks understanding of how text generation models combined with prompting techniques compare to more established methods for text classification such as fine-tuning masked language models. In this paper, we address this research gap by performing a large-scale evaluation study for 16 text classification datasets covering binary, multiclass, and multilabel problems. In particular, we compare zero- and few-shot approaches of large language models to fine-tuning smaller language models. We also analyse the results by prompt, classification type, domain, and number of labels. In general, the results show how fine-tuning smaller and more efficient language models can still outperform few-shot approaches of larger language models, which have room for improvement when it comes to text classification.

4/16/2024