On the Effect of (Near) Duplicate Subwords in Language Modelling

2404.06508

Published 5/6/2024 by Anton Schafer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel

💬

Abstract

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Tokenization is a core part of language models (LMs)
It involves splitting a character sequence into subwords and assigning them arbitrary indices
While typically lossless, this process can reduce the sample efficiency of LM training
The paper studies the impact of "near duplicate" subwords on LM training efficiency

Plain English Explanation

Tokenization is an important step in how language models (LMs) work. It involves breaking up a piece of text into smaller pieces, called "subwords", and assigning each subword a unique number. This is done before the text is used to train the LM.

Typically, this tokenization process doesn't lose any information. However, it could make it harder for LMs to learn efficiently. This is because tokenization removes some character-level details, which could make it difficult for the LM to recognize similarities between related subwords, like "now" and "Now".

The researchers in this paper wanted to understand how much this loss of character-level information impacts LM training. They did this by:

Creating an experiment where each subword had a "duplicate" - an identical copy. This gave them an upper bound on how much LMs could improve if they could perfectly recognize these near-duplicate subwords.
Studying how LMs perform when trained on naturally occurring "near duplicate" subwords (rather than the artificial duplicates).

The key findings were:

LMs need about 17% more data when trained on the fully duplicated subwords, suggesting there is room for improvement.
However, when looking at naturally occurring near duplicates, merging them actually hurts LM performance.

This implies that while tokenization can reduce training efficiency, the natural near duplicates may not be as similar as expected, limiting the potential gains from better handling them.

Technical Explanation

The paper investigates the impact of "near duplicate" subwords on the training efficiency of language models (LMs). Tokenization is a core component of how LMs work, where a character sequence is split into subwords and assigned arbitrary indices before being used to train the model.

While typically lossless, the researchers hypothesize that this tokenization process may reduce the sample efficiency of LM training. By removing character-level information, it could make it harder for LMs to generalize across similar subwords, such as "now" and "Now".

To study this, the researchers first design an experiment that gives an upper bound on how much LMs could improve if they could perfectly generalize across near duplicate subwords. They do this by creating a fully duplicated vocabulary, where each subword has an identical copy. Experimentally, they find that LMs need roughly 17% more data when trained in this fully duplicated setting.

Next, the researchers investigate the impact of naturally occurring near duplicate subwords on LM performance. Here, they find that merging these near duplicates considerably hurts LM performance. This suggests that while subword duplication negatively impacts LM training efficiency, the naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.

Critical Analysis

The paper provides a systematic analysis of how subword tokenization can impact the training efficiency of language models. The key experimental approach of creating a fully duplicated vocabulary to establish an upper bound on potential gains is novel and insightful.

However, the paper does not explore the reasons why naturally occurring near duplicates may not be as similar as expected. It would be valuable to understand the specific linguistic or contextual factors that contribute to this. Additionally, the paper does not discuss potential methods for better handling near duplicates, beyond simply merging them.

Further research could investigate more sophisticated techniques for identifying and leveraging near duplicate subwords, potentially drawing on work in systematic analysis of subwords for cross-lingual transfer, multi-word tokenization, or learning mutually informed representations of characters and subwords. Exploring the impacts on different types of language models and tasks, as well as across different languages, could also yield additional insights.

Conclusion

This paper provides a detailed analysis of the impact of "near duplicate" subwords on the training efficiency of language models. While the tokenization process used in LMs can reduce sample efficiency, the researchers find that naturally occurring near duplicates may not be as similar as expected, limiting the potential gains from better handling them.

The experimental approach of creating a fully duplicated vocabulary offers an insightful upper bound on possible improvements, but further research is needed to understand the factors contributing to the lack of similarity in natural near duplicates. Exploring more sophisticated techniques for identifying and leveraging these subwords could yield valuable improvements in LM training and performance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔄

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Francois Meyer, Jan Buys

Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.

4/1/2024

cs.CL

⚙️

Toward a Theory of Tokenization in LLMs

Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran

While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{text{th}}$-order Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{text{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.

4/15/2024

cs.CL cs.LG

Multi-word Tokenization for Sequence Compression

Leonidas Gee, Leonardo Rigutini, Marco Ernandes, Andrea Zugarini

Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.

4/8/2024

cs.CL cs.LG

🌀

Learning Mutually Informed Representations for Characters and Subwords

Yilin Wang, Xinyi Hu, Matthew R. Gormley

Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling (intraword code-switching). Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. We make our code publically available.

4/9/2024

cs.CL cs.LG