Vocabulary Expansion for Low-resource Cross-lingual Transfer

Read original: arXiv:2406.11477 - Published 9/17/2024 by Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

🔄

Overview

Large language models (LLMs) have shown impressive capabilities in various languages beyond just English.
However, LLMs require more computational steps to generate non-English text due to their reliance on English-centric tokenizers, vocabularies, and pre-training data.
This results in higher usage costs for non-English speakers.
Vocabulary expansion with target language tokens is a widely used approach to address this issue, but previous work has focused on high-resource settings with abundant target language data.
This paper investigates sample-efficient adaptation strategies for vocabulary expansion in low-resource settings, where data and compute resources are limited.

Plain English Explanation

Large language models (LLMs) like GPT-3 and BERT have become incredibly capable at understanding and generating text in many different languages. However, these models were initially developed for English, so they aren't as efficient at working with non-English languages.

The main problem is that LLMs use English-based vocabularies and training data, which means they need to do more computational work to generate text in other languages. This can make it more expensive for people who don't speak English to use these models.

Vocabulary expansion is a common way to address this issue. The idea is to add more vocabulary words from the target language to the model, so it can work more efficiently. Previous research on this topic has focused on situations where there is a lot of data available in the target language.

But what about languages that don't have as much data or computing power available? This paper explores strategies for adapting LLMs to work well in these "low-resource" settings, where data and compute are limited. The researchers test different ways of initializing the new vocabulary words and see how much target language data is needed for the model to perform well.

Technical Explanation

The researchers investigated sample-efficient adaptation strategies for vocabulary expansion of LLMs in low-resource settings. They explored factors like target vocabulary size, initialization methods, and the amount of target language data available for adaptation.

They conducted extensive experiments across a diverse set of languages, tasks, and LLM models. The results showed that simpler heuristic-based approaches to initializing the new vocabulary word embeddings were more efficient and robust to changes in target vocabulary size and adaptation data, compared to more sophisticated methods that rely on external data and models.

This suggests that in low-resource scenarios, it may be more practical to use simpler techniques for expanding the vocabulary of LLMs, rather than needing access to large amounts of target language data or complex external resources. The researchers' findings have implications for making LLMs more accessible and usable in a wider range of languages and contexts, especially where data and compute are limited.

Critical Analysis

The paper provides a valuable exploration of vocabulary expansion techniques for LLMs in low-resource settings. However, there are a few limitations and areas for further research that could be considered:

The experiments were conducted across a diverse set of languages, but the specific resource constraints and challenges of each language were not deeply analyzed. Further research could explore adaptation strategies tailored to the unique characteristics of different low-resource language families.
The paper focuses on vocabulary expansion, but other cross-lingual adaptation techniques like multilingual fine-tuning or prompting could also be investigated for low-resource settings.
The findings around the effectiveness of simpler heuristic-based initialization methods are interesting, but it would be valuable to further understand how vocabulary sharing and cross-lingual transfer influence the performance of these different initialization approaches.

Overall, this paper makes a valuable contribution to the field of multilingual LLM adaptation, especially in resource-constrained environments. The insights provided can help make these powerful models more accessible and usable across a wider range of languages and contexts.

Conclusion

This paper explores strategies for adapting large language models (LLMs) to work more efficiently with non-English languages, where data and compute resources may be limited. The researchers investigated vocabulary expansion techniques and found that simpler heuristic-based approaches to initializing new vocabulary words performed better than more sophisticated methods in low-resource settings.

These findings have important implications for making LLMs more accessible and usable across a wider range of languages, rather than being heavily biased towards English. By developing more sample-efficient adaptation strategies, LLMs can become powerful tools for communication and knowledge sharing in underserved linguistic communities. Further research is needed to build on these insights and continue pushing the boundaries of multilingual language modeling.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

Vocabulary Expansion for Low-resource Cross-lingual Transfer

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this paper, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, maintaining competitive downstream performance to baselines with only 30K sentences ($sim$0.01GB text data) from the target language.

9/17/2024

An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models

Nandini Mundra, Aditya Nanda Kishore, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra

Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model's tokenizer, leading to inadequate representation of new languages and necessitating an expansion of the tokenizer. The initialization of the embeddings corresponding to new vocabulary items presents a further challenge. Current strategies require cross-lingual embeddings and lack a solid theoretical foundation as well as comparisons with strong baselines. In this paper, we first establish theoretically that initializing within the convex hull of existing embeddings is a good initialization, followed by a novel but simple approach, Constrained Word2Vec (CW2V), which does not require cross-lingual embeddings. Our study evaluates different initialization methods for expanding RoBERTa and LLaMA 2 across four languages and five tasks. The results show that CW2V performs equally well or even better than more advanced techniques. Additionally, simpler approaches like multivariate initialization perform on par with these advanced methods indicating that efficient large-scale multilingual continued pretraining can be achieved even with simpler initialization methods.

7/9/2024

💬

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English. This results in increased inference time and costs. Cross-lingual vocabulary adaptation (CVA) methods have been proposed for adapting models to a target language aiming to improve downstream performance. However, the effectiveness of these methods on increasing inference efficiency of generative LLMs has yet to be explored. In this paper, we perform an empirical study of five CVA methods on four generative LLMs (including monolingual and multilingual models) across four typologically-diverse languages and four natural language understanding tasks. We find that CVA substantially contributes to LLM inference speedups of up to 271.5%. We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.

6/18/2024

💬

LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language

Cagri Toraman

Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.

5/14/2024