Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

Read original: arXiv:2406.10851 - Published 6/18/2024 by Byung-Doh Oh, William Schuler

Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

Overview

This paper investigates a confound in language models caused by leading whitespaces in the subword vocabulary.
The authors show that this confound can lead to significant errors in calculating word probabilities, which is a critical task for many natural language processing applications.
The paper provides a detailed analysis of the issue and proposes potential solutions to address it.

Plain English Explanation

Language models are artificial intelligence systems that are trained on vast amounts of text data to learn the patterns and structure of human language. These models are then used in a variety of applications, such as text generation, translation, and question answering.

One key component of language models is the subword vocabulary, which is a set of smaller units that the model uses to represent words. This allows the model to handle a wide range of vocabulary, including rare or complex words, without the need to store every possible word in its memory.

However, the authors of this paper have discovered a potential issue with the way subword vocabularies are constructed. They found that many subwords in the vocabulary can have leading whitespaces, which are invisible characters that appear at the beginning of the subword. This can create a confound, or a source of systematic error, in the way the model calculates the probabilities of words.

To illustrate this, imagine you're trying to predict the next word in a sentence. The language model might assign a high probability to a word that starts with a leading whitespace, even though that word wouldn't actually make sense in the context of the sentence. This can lead to significant errors in the model's predictions, which can have downstream consequences for the applications that rely on these predictions.

The paper provides a detailed technical analysis of this issue and proposes potential solutions, such as modifying the way subword vocabularies are constructed or adjusting the model's decoding process to account for the leading whitespaces. By addressing this confound, the authors hope to improve the accuracy and reliability of language models in a wide range of real-world applications.

Technical Explanation

The paper investigates a confound that arises from the presence of leading whitespaces in the subword vocabulary of language models. Subword vocabularies are commonly used in modern language models to handle a wide range of vocabulary, including rare and complex words, without the need to store every possible word in the model's memory.

However, the authors found that many subwords in these vocabularies can have leading whitespaces, which are invisible characters that appear at the beginning of the subword. This can create a significant confound in the way the model calculates the probabilities of words.

To demonstrate this issue, the authors conducted a series of experiments using several popular language models, including BERT, GPT-2, and RoBERTa. They found that the presence of leading whitespaces in the subword vocabulary can lead to substantial errors in the model's word probability calculations, with some words being assigned probabilities that are orders of magnitude higher or lower than their true probabilities.

The authors attribute this confound to the way language models handle subword tokenization and decoding. When a model encounters a word, it first breaks it down into a sequence of subwords, and then calculates the probability of the word based on the probabilities of the individual subwords. However, the presence of leading whitespaces can skew these subword probabilities, leading to inaccurate word probability estimates.

To address this issue, the authors propose several potential solutions, such as modifying the subword vocabulary construction process to eliminate leading whitespaces, or adjusting the model's decoding process to account for the presence of leading whitespaces. They also discuss the broader implications of this confound for language model evaluation and the development of more robust and reliable natural language processing systems.

Critical Analysis

The paper provides a thorough and well-designed investigation of an important issue in language model design and evaluation. The authors have clearly demonstrated the existence of the leading whitespace confound and its significant impact on word probability calculations, which is a crucial task for many real-world applications of language models.

One potential limitation of the study is that it focuses primarily on a few well-known language models, and it's unclear whether the findings would generalize to other models or subword tokenization approaches. Additionally, the paper does not provide a comprehensive analysis of the impact of this confound on downstream tasks, such as text generation or question answering.

That said, the authors have raised an important issue that deserves further attention from the natural language processing research community. The proposed solutions, such as modifying the subword vocabulary construction process or adjusting the decoding algorithms, seem promising and warrant further investigation.

Overall, this paper makes a valuable contribution to the ongoing efforts to improve the robustness and reliability of language models. By addressing fundamental issues like the leading whitespace confound, researchers can work towards developing more accurate and trustworthy natural language processing systems that can be reliably deployed in real-world applications.

Conclusion

This paper has uncovered a significant confound in language models caused by the presence of leading whitespaces in their subword vocabularies. The authors have demonstrated that this issue can lead to substantial errors in the way language models calculate word probabilities, which is a critical task for many natural language processing applications.

By highlighting this problem and proposing potential solutions, the authors have made an important contribution to the field of language modeling. Addressing the leading whitespace confound could lead to more robust and reliable language models that can be more effectively deployed in real-world applications, such as text generation, translation, and question answering.

The insights and findings presented in this paper also have broader implications for the development of more transparent and trustworthy artificial intelligence systems. By identifying and addressing fundamental issues like this confound, researchers can work towards creating AI models that are more accurate, interpretable, and accountable, ultimately leading to more beneficial and responsible applications of these technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

Byung-Doh Oh, William Schuler

Word-by-word conditional probabilities from Transformer-based language models are increasingly being used to evaluate their predictions over minimal pairs or to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the subword tokenization scheme of such language models, which has gone unaddressed thus far. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in word probabilities that sum to more than one, thereby violating the axiom that $mathsf{P}(Omega) = 1$. This property results in a misallocation of word-by-word surprisal, where the unacceptability of the current 'end of word' is incorrectly carried over to the next word. Additionally, language models' such implicit prediction of word boundaries is incongruous with psycholinguistic experiments where human subjects directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word, which resolves this confound. As a case study, we show that this results in significantly different estimates of garden-path effects in transitive/intransitive sentences, where a comma is strongly expected before the critical word.

6/18/2024

How to Compute the Probability of a Word

Tiago Pimentel, Clara Meister

Language models (LMs) estimate the probability distribution over sequences of natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research. While we are usually concerned with measuring these values for words, most LMs operate over subwords. Despite seemingly straightforward, accurately computing probabilities over one unit given probabilities over the other requires care. Indeed, we show here that many recent linguistic studies have been incorrectly computing these values. This paper derives the correct methods for computing word probabilities, highlighting issues when relying on language models that use beginning-of-word (bow)-marking tokenisers, e.g., the GPT family. Empirically, we show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.

6/21/2024

Tokenization Falling Short: The Curse of Tokenization

Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.

6/18/2024

⚙️

Toward a Theory of Tokenization in LLMs

Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran

While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{text{th}}$-order Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{text{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.

4/15/2024