Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs

Read original: arXiv:2403.19827 - Published 8/13/2024 by Kanishka Misra, Kyle Mahowald

Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs

Introduction

The paper investigates how Large Language Models (LLMs) can learn rare grammatical structures, such as the AANN construction ("a beautiful five days in Texas"), from limited input data. The authors hypothesize that LLMs learn these rare phenomena by generalizing from more frequent, related constructions in the input. They train transformer models on systematically manipulated versions of the 100M-word BabyLM corpus to study the extent to which exposure to frequent, related phenomena enables generalization to novel instances of the AANN construction.

The study yields three main findings:

LMs successfully generalize to novel instances of the AANN construction, even when not exposed to any AANNs during training, suggesting that related items in the training data enable non-trivial performance in acceptability judgments.
Systematically removing AANN-related phenomena from the training data, such as measure noun phrases treated as singular, leads to worse performance on predicting novel AANNs, highlighting the role of these phenomena in generalization.
LMs that encounter AANNs with more variability in the adjective, numeral, and noun slots show better generalization than those exposed to more restricted, repeating instances, mirroring findings from human language acquisition and cognitive psychology.

These results demonstrate that a sophisticated statistical learner can learn rare linguistic phenomena by generalizing from key related constructions in the input, without relying on strong innate priors.

General Methods

The paper describes methods for characterizing the learning of a rare grammatical construction called the aann (e.g., "a whopping ninety LMs"). The authors use the BabyLM-strict corpus for training language models (LMs) and detect aann instances using regular expressions and part-of-speech tagging. They train autoregressive transformer LMs with 12 layers and attention heads, averaging results over three runs with different random seeds.

To test the LMs' knowledge of the aann, the authors use a dataset containing acceptability ratings for templatically generated sentences. They compare well-formed aann instances to corrupted versions that manipulate adjective-numeral order, article presence, adjective presence, and numeral presence. The Syntactic Log-odds Ratio (SLOR) is used to score sentences, comparing the probability of the construction given the prefix estimated by the LM to that estimated by a unigram model. Accuracy is calculated based on whether the well-formed construction scores higher than all corrupted instances.

In subsequent experiments, the authors ablate parts of the BabyLM corpus that conform to certain linguistic or statistical hypotheses. To maintain the same quantity of training data, they up-sample non-hypothesis-conforming utterances after ablation. This allows them to compare LMs that differ in content but not in the total number of tokens encountered during training.

Experiment 1: LMs learn about aanns without having seen a single instance

The paper investigated the extent to which language models (LMs) trained on the BabyLM corpus learn the "aann" construction (e.g., "a fine eighteen months"). The LMs achieved accuracies around 70%, substantially above chance, even though positive evidence of the construction made up only 0.02% of their training data. Larger state-of-the-art LMs like Llama-2-7B and GPT-2 XL achieved even higher accuracies of 83% and 78%, respectively.

Interestingly, LMs trained on the BabyLM corpus with all 2,301 detected "aann" instances removed still achieved an accuracy of 54% on judging the acceptability of "aann" constructions, 47.75 points above chance. This suggests LMs can learn the acceptability of a construction's instances without seeing any positive occurrences, likely driven by systematic patterns in the corpus.

The paper also tested counterfactual variants that violate English grammar, such as "anan" and "naan". LMs trained on these variants did not learn them as well as "aann", and still assigned non-trivial probability to unseen "aann" instances. This implies LMs pick up cues from related constructions to generalize to novel "aann" examples.

Experiment 2: Keys to Learning aanns

The paper investigates the aann construction in language models (LMs) and hypothesizes four phenomena that may contribute to its learning, despite the construction being rare in training data:

Phrases like "the beautiful five days" where "the" takes a plural noun
Measure noun phrases with plural nouns attached to an indefinite article (e.g., "a few days")
Measure nouns treated as singular in terms of agreement (e.g., "Five miles is a long way to go")
The higher likelihood of adjectives following indefinite articles compared to numerals

The effect of these phenomena on aann acceptability is measured by holding out instances during training and comparing slor (syntactic log-odds ratio) values. A control condition with random removal of instances is also considered.

Experiments are conducted under two settings: with aanns removed during training along with the phenomena, and with aanns seen during training when possible. Results show that holding out the hypothesized phenomena has non-trivial effects on LMs' ratings of unseen well-formed aanns, with balancing the frequency of adjectives and numerals following an article having the greatest effect. These patterns are absent in 4-gram LMs, suggesting they do not arise from shallow surface statistics.

The paper concludes that when LMs see evidence of the aann construction, they do learn from it. However, related phenomena where measure nouns are treated as singular show notable effects even when aanns are present, indicating they enable additional learning.

Experiment 3: The Role of Variability

The paper investigates how the variability of open slots in a construction affects language models' ability to generalize to unseen instances of that construction, specifically focusing on adjective-adjective-noun-noun (aann) constructions. The authors hypothesize that instances of aanns with greater open-slot variability, i.e., evidence that many different adjectives, numerals, and nouns can fill their respective positions, would lead language models to assign greater likelihood to unseen aanns.

The experiment divided aann-containing utterances from the BabyLM corpus into two subsets: one with highly frequent but restricted slot-fillers, and another with less frequent but more variable slot-fillers. Language models were trained on the BabyLM corpus containing either of these subsets, and the results were compared to models trained on the unablated BabyLM and a condition with no aanns.

The findings showed that language models exposed to aanns with highly variable open slots demonstrated slot fill-in likelihoods comparable to or greater than models trained on all aanns. In contrast, models exposed to aanns with low variability performed similarly to models that never saw any aanns. These results support the hypothesis that slot-variability affects the extent to which language models permit productive uses of a construction.

Figure 4: slors on aanns from Mahowald (2023) for LMs trained on BabyLM with low and high variability in the observed instances of aann. slor for unablated BabyLM-trained LM shown with dotted line.

Conclusion

The paper explores how language models handle rare linguistic phenomena, often referred to as the "long tail" of language. Studying these phenomena is important because language models perform better with more data and because the human ability to generalize to rare constructions is central to language knowledge. The authors found that language models trained on human-scale data can learn a rare construction called the aann, even without direct examples in the training data. This learning is mediated by occurrences of related constructions during training. The results contribute to a growing body of research demonstrating the ability of large language models to learn linguistic constructions.

Limitations

The paper discusses potential future work and limitations of the current method. Extending the method to a wider range of constructions is valuable but not straightforward, as it requires identifying idiosyncratic constructions and developing testable hypotheses about their learnability from limited data. This limitation highlights the need for collaboration between theoretical and computational linguists. Another limitation is the computational expense of repeatedly training language models from scratch. Alternative methods, such as representational editing, could be explored. The paper focuses on linguistic form rather than testing the ability to interpret constructions for downstream semantic tasks, which would be an informative extension.

Acknowledgments

Acknowledgments: The authors acknowledge funding from NSF Grant 2104995 awarded to Kyle Mahowald. They thank Adele Goldberg, Leonie Weissweiler, the computational linguistics research group at UT Austin, the syntax-semantics research group at UT Austin, and the audience at the Texas Linguistics Society meeting for helpful conversations. They also thank Chris Potts for his paper on the PiPP construction which inspired the "keys to all of this" idea in their own work.

Appendix A LM training details

The authors train language models using the OPT architecture on various versions of the BabyLM corpus. They tune the learning rate for each instance of the corpus based on the validation set, and then train two additional language models with different seeds using the best learning rate. In total, they train 6 language models for each ablation of the BabyLM corpus, resulting in 90 language models for all experiments. Table 3 provides more details about the training process.

Appendix B Detecting aanns and related phenomena

The paper describes methods to extract constructions and phenomena from the BabyLM corpus. The methods primarily rely on the surface form of sentences, part-of-speech (POS) tag sequences, and in some cases, dependency parses. The authors used the spacy library with the en_web_trf model based on RoBERTa-base for POS tagging and parsing.

To detect AANNs (Adjective-Adjective-Noun-Nouns), the authors constructed a regex pattern over POS-tagged sequences. The regex allows for multiple adjectives, optional adverbs, multi-word noun phrases with plural head-nouns, and numeral expressions. They also used adjectives like 'few', 'dozen', 'couple', 'several', 'many', and 'more' as proxies for numerals.

For DT ANNs (Determiner-Adjective-Noun-Nouns), the same procedure as AANNs was followed without restricting the determiner position to indefinite determiners.

The authors also considered cases where plural nouns are attached to an indefinite article, such as "a few days" or "a couple liters". These cases were detected using dependency configurations involving det, amod, quantmod, and nummod relations.

Lastly, they examined measure noun-phrases with plural nouns treated as singular via agreement with a verb, like "five dollars is plenty". Such cases were detected using dependency configurations involving nummod and nsubj relations.

Appendix C A/An + ADJ/NUM frequency balancing

The paper analyzes the BabyLM corpus and its POS-tagged version, revealing that adjectives are about 14.6 times more likely to follow an indefinite article compared to numerals. To balance these values, 571,874 instances of adjectives following an indefinite article are removed, making it the most significant ablation performed in the study. The analysis provides insights into the linguistic patterns within the BabyLM corpus and the steps taken to address the imbalance between adjectives and numerals following indefinite articles.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs

Kanishka Misra, Kyle Mahowald

Language models learn rare syntactic phenomena, but the extent to which this is attributable to generalization vs. memorization is a major open question. To that end, we iteratively trained transformer language models on systematically manipulated corpora which were human-scale in size, and then evaluated their learning of a rare grammatical phenomenon: the English Article+Adjective+Numeral+Noun (AANN) construction (``a beautiful five days''). We compared how well this construction was learned on the default corpus relative to a counterfactual corpus in which AANN sentences were removed. We found that AANNs were still learned better than systematically perturbed variants of the construction. Using additional counterfactual corpora, we suggest that this learning occurs through generalization from related constructions (e.g., ``a few days''). An additional experiment showed that this learning is enhanced when there is more variability in the input. Taken together, our results provide an existence proof that LMs can learn rare grammatical phenomena by generalization from less rare phenomena. Data and code: https://github.com/kanishkamisra/aannalysis.

8/13/2024

💬

AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, Weizhu Chen

Many natural language processing (NLP) tasks rely on labeled data to train machine learning models with high performance. However, data annotation is time-consuming and expensive, especially when the task involves a large amount of data or requires specialized domains. Recently, GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. In this paper, we first claim that large language models (LLMs), such as GPT-3.5, can serve as an excellent crowdsourced annotator when provided with sufficient guidance and demonstrated examples. Accordingly, we propose AnnoLLM, an annotation system powered by LLMs, which adopts a two-step approach, explain-then-annotate. Concretely, we first prompt LLMs to provide explanations for why the specific ground truth answer/label was assigned for a given example. Then, we construct the few-shot chain-of-thought prompt with the self-generated explanation and employ it to annotate the unlabeled data with LLMs. Our experiment results on three tasks, including user input and keyword relevance assessment, BoolQ, and WiC, demonstrate that AnnoLLM surpasses or performs on par with crowdsourced annotators. Furthermore, we build the first conversation-based information retrieval dataset employing AnnoLLM. This dataset is designed to facilitate the development of retrieval models capable of retrieving pertinent documents for conversational text. Human evaluation has validated the dataset's high quality.

4/8/2024

💬

NEO-BENCH: Evaluating Robustness of Large Language Models with Neologisms

Jonathan Zheng, Alan Ritter, Wei Xu

The performance of Large Language Models (LLMs) degrades from the temporal drift between data used for model training and newer text seen during inference. One understudied avenue of language change causing data drift is the emergence of neologisms -- new word forms -- over time. We create a diverse resource of recent English neologisms by using several popular collection methods. We analyze temporal drift using neologisms by comparing sentences containing new words with near-identical sentences that replace neologisms with existing substitute words. Model performance is nearly halved in machine translation when a single neologism is introduced in a sentence. Motivated by these results, we construct a benchmark to evaluate LLMs' ability to generalize to neologisms with various natural language understanding tasks and model perplexity. Models with later knowledge cutoff dates yield lower perplexities and perform better in downstream tasks. LLMs are also affected differently based on the linguistic origins of words, indicating that neologisms are complex for static LLMs to address. We will release our benchmark and code for reproducing our experiments.

8/14/2024

Real-Time Anomaly Detection and Reactive Planning with Large Language Models

Rohan Sinha, Amine Elhafsi, Christopher Agia, Matthew Foutter, Edward Schmerling, Marco Pavone

Foundation models, e.g., large language models (LLMs), trained on internet-scale data possess zero-shot generalization capabilities that make them a promising technology towards detecting and mitigating out-of-distribution failure modes of robotic systems. Fully realizing this promise, however, poses two challenges: (i) mitigating the considerable computational expense of these models such that they may be applied online, and (ii) incorporating their judgement regarding potential anomalies into a safe control framework. In this work, we present a two-stage reasoning framework: First is a fast binary anomaly classifier that analyzes observations in an LLM embedding space, which may then trigger a slower fallback selection stage that utilizes the reasoning capabilities of generative LLMs. These stages correspond to branch points in a model predictive control strategy that maintains the joint feasibility of continuing along various fallback plans to account for the slow reasoner's latency as soon as an anomaly is detected, thus ensuring safety. We show that our fast anomaly classifier outperforms autoregressive reasoning with state-of-the-art GPT models, even when instantiated with relatively small language models. This enables our runtime monitor to improve the trustworthiness of dynamic robotic systems, such as quadrotors or autonomous vehicles, under resource and time constraints. Videos illustrating our approach in both simulation and real-world experiments are available on this project page: https://sites.google.com/view/aesop-llm.

7/12/2024