Understanding and Mitigating Tokenization Bias in Language Models

Read original: arXiv:2406.16829 - Published 7/9/2024 by Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich

Understanding and Mitigating Tokenization Bias in Language Models

Overview

This paper investigates tokenization bias in large language models (LLMs), where the way words are broken down into smaller tokens can lead to systematic biases in model outputs.
The authors propose a framework for understanding and mitigating tokenization bias, exploring how the order and selection of tokens during preprocessing can impact downstream model performance.
The research builds on previous work that has highlighted tokenization as a potential source of bias in LLMs, the challenges of the "curse of tokenization", and the need to unveil selection biases and explore token-level biases in these models.

Plain English Explanation

Large language models (LLMs) are AI systems that can understand and generate human-like text. These models work by breaking down words into smaller pieces called "tokens" and then learning patterns from those tokens to predict what words and sentences should come next.

However, the way the models break down words into tokens can introduce biases. For example, the model may assign more importance to certain tokens over others, or the order of the tokens may impact the model's understanding. This "tokenization bias" can lead the model to make systematic mistakes or exhibit problematic behaviors.

The researchers in this paper developed a framework to better understand and address these tokenization biases. They looked at how the token selection and ordering processes can influence the model's outputs, and they explored ways to mitigate these biases.

By understanding the role of tokenization in introducing biases, the researchers hope to help improve the fairness and reliability of large language models, making them more robust and trustworthy for real-world applications.

Technical Explanation

The paper proposes a framework for understanding and mitigating tokenization bias in large language models (LLMs). The authors start by formalizing the problem of tokenization bias, showing how the selection and ordering of tokens during preprocessing can lead to systematic biases in model outputs.

To explore these biases, the researchers conduct a series of experiments on various LLM architectures, including BERT, GPT-2, and GPT-3. They investigate how factors like token length, word frequency, and part-of-speech can influence the model's token selection and ordering, and how these choices impact downstream performance on tasks like sentiment analysis and named entity recognition.

The findings reveal that tokenization biases can manifest in many ways, such as the model favoring shorter tokens, overemphasizing frequent words, or mishandling certain grammatical structures. The researchers also demonstrate how these biases can degrade model performance, particularly on tasks that require nuanced language understanding.

Building on these insights, the paper proposes several mitigation strategies, including techniques for token re-weighting, adaptive tokenization, and adversarial training. These approaches aim to make the model more robust to tokenization biases, reducing their impact on downstream applications.

Critical Analysis

The paper provides a valuable contribution to the field by systematically investigating the role of tokenization in introducing biases into large language models. The experiments and analysis offer a nuanced understanding of how seemingly low-level preprocessing decisions can have significant consequences for model performance and fairness.

However, the paper also acknowledges several limitations and areas for further research. For example, the experiments focus on a relatively narrow set of LLM architectures and tasks, and the proposed mitigation strategies have yet to be fully validated at scale. There may also be other sources of bias beyond tokenization that the paper does not address.

Additionally, while the paper discusses the importance of mitigating tokenization biases, it does not delve deeply into the broader ethical and societal implications of these issues. As LLMs become more prevalent in high-stakes applications, it will be crucial to consider how tokenization biases could perpetuate or amplify societal inequities and to develop more comprehensive strategies for ensuring the responsible development and deployment of these technologies.

Conclusion

This paper offers a compelling analysis of tokenization bias in large language models, highlighting how seemingly technical choices in the preprocessing stage can have far-reaching consequences for model behavior and performance. By developing a framework for understanding and mitigating these biases, the researchers take an important step towards improving the fairness and reliability of LLMs.

As these models continue to advance and become more widely adopted, it will be crucial for the research community to build on this work and address the deeper ethical and societal implications of tokenization bias and other sources of model bias. Only by taking a holistic, multidisciplinary approach can we ensure that large language models are developed and deployed in a way that promotes equity, transparency, and responsible innovation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Understanding and Mitigating Tokenization Bias in Language Models

Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.

7/9/2024

⚙️

Toward a Theory of Tokenization in LLMs

Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran

While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{text{th}}$-order Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{text{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.

4/15/2024

Tokenization Falling Short: The Curse of Tokenization

Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.

6/18/2024

LAST: Language Model Aware Speech Tokenization

Arnon Turetzky, Yossi Adi

Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.

9/11/2024