TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

Read original: arXiv:2401.06620 - Published 5/24/2024 by Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Schutze

TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

Overview

The paper proposes a contrastive learning framework called TransliCo to address the script barrier in multilingual pretrained language models.
The framework aims to enhance the cross-script transfer capabilities of these models, which can be limited due to the diversity of writing scripts across languages.
TransliCo leverages contrastive learning to learn script-invariant representations, improving the models' performance on cross-script tasks.

Plain English Explanation

The paper discusses a new approach called TransliCo that aims to help multilingual language models work better across different writing scripts. These models, which are trained on text from many languages, can sometimes struggle when dealing with languages that use very different scripts or writing systems, like the Latin alphabet versus the Cyrillic alphabet.

The key idea behind TransliCo is to use a technique called "contrastive learning" to train the models to focus on the meaning of the text rather than the specific script it's written in. This helps the models learn representations, or internal understandings, of the text that are more universal and transferable across different scripts.

The authors show that this approach can improve the models' performance on tasks that involve translating between languages with different scripts, like translating from Russian to English. This is an important capability, as the diversity of writing systems used around the world can be a barrier to getting multilingual AI systems to work well across all languages.

Technical Explanation

The paper introduces a contrastive learning framework called TransliCo to address the "script barrier" in multilingual pretrained language models. This barrier arises due to the diversity of writing scripts across the world's languages, which can limit the cross-script transfer capabilities of these models.

TransliCo leverages contrastive learning to learn script-invariant representations, aiming to improve the models' performance on cross-script tasks. The framework consists of three main components:

Script Classifier: A module that predicts the script of the input text, trained on a script classification dataset.
Contrastive Objective: A loss function that encourages the model to learn representations that are similar for text with the same meaning but different scripts, and dissimilar for text with different meanings.
Finetuning: The pretrained language model is finetuned on downstream tasks using the contrastive objective.

The authors evaluate TransliCo on a range of cross-script tasks, including zero-shot cross-lingual transfer and cross-lingual natural language inference. They show that the framework can significantly improve the performance of multilingual language models on these tasks, demonstrating the effectiveness of the contrastive learning approach in addressing the script barrier.

Critical Analysis

The paper provides a novel and compelling approach to enhancing the cross-script capabilities of multilingual language models. The authors acknowledge that the script diversity across languages can be a significant challenge for these models, and their TransliCo framework offers a principled solution to this problem.

One potential limitation of the work is that it relies on the availability of a script classification dataset to train the script classifier component. In some low-resource language settings, such a dataset may not be readily available, which could limit the applicability of the framework.

Additionally, the authors do not extensively explore the impact of the script classifier's performance on the overall effectiveness of TransliCo. It would be valuable to understand how sensitive the framework is to the accuracy of the script classifier, and whether there are ways to make it more robust to potential errors in script identification.

Nevertheless, the paper represents an important contribution to the field of multilingual language modeling, and the TransliCo framework could have significant implications for improving the cross-lingual representation alignment capabilities of these models across diverse writing scripts.

Conclusion

The TransliCo framework proposed in this paper offers a novel approach to addressing the script barrier in multilingual pretrained language models. By leveraging contrastive learning to learn script-invariant representations, the framework can significantly improve the models' performance on cross-script tasks, which is an important capability for deploying these models in real-world multilingual settings.

The paper demonstrates the effectiveness of the TransliCo approach through extensive experiments, and the framework could have broader implications for enhancing the cross-lingual transfer abilities of multilingual language models more generally. As the field of multilingual AI continues to advance, addressing challenges like the script barrier will be crucial for developing robust and universally applicable language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Schutze

The world's more than 7000 languages are written in at least 293 scripts. Due to various reasons, many closely related languages use different scripts, which poses a difficulty for multilingual pretrained language models (mPLMs) in learning crosslingual knowledge through lexical overlap. As a consequence, mPLMs are faced with a script barrier: representations from different scripts are located in different subspaces, which can result in crosslingual transfer involving languages of different scripts performing suboptimally. To address this problem, we propose TransliCo, a framework that optimizes the Transliteration Contrastive Modeling (TCM) objective to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script (in our case Latin), which enhances uniformity in the representation space for different scripts. Using Glot500-m, an mPLM pretrained on over 500 languages, as our source model, we fine-tune it on a small portion (5%) of its training data, and refer to the resulting model as Furina. We show that Furina not only better aligns representations from distinct scripts but also outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks. Additionally, we achieve consistent improvement in a case study on the Indic group where the languages exhibit areal features but use different scripts. We make our code and models publicly available.

5/24/2024

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Schutze

Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks. We make our code and models publicly available at url{https://github.com/cisnlp/TransMI}.

5/17/2024

🔄

TaCo: Enhancing Cross-Lingual Transfer for Low-Resource Languages in LLMs through Translation-Assisted Chain-of-Thought Processes

Bibek Upadhayay, Vahid Behzadan

Creating multilingual LLMs poses a significant challenge. Pretraining or fine-tuning LLMs to adopt new languages is evidently very costly. Furthermore, there exist limitations concerning benchmark datasets and the metrics used to measure model performance in multilingual settings. This paper proposes cost-effective solutions to both aforementioned challenges. Firstly, we introduce the Multilingual Instruction-Tuning Dataset (MITS), comprised of Alpaca-52K, Dolly-15K, and Vicuna Benchmark translations into 132 languages. Secondly, we propose a new method called emph{TaCo: Translation-Assisted Cross-Linguality}, which utilizes translations in a chain-of-thought process to instruction-tune LLMs on new languages through a curriculum-learning process. As a proof of concept, we experimented with the instruction-tuned Guanaco-33B model, performing further instruction tuning using our proposed TaCo method in three low-resource languages and one high-resource language. Our results indicate that the TaCo method impresses GPT-4 with an 82% score for a low-resource language in the Vicuna Benchmark dataset, doubling the performance in contrast to instruction tuning alone. Furthermore, TaCo shows promise in creating multilingual LLMs, even for low-resource languages. We have released our datasets and model adaptersfootnote{https://github.com/UNHSAILLab/TaCo} , encouraging the research community to utilize these resources to advance work on multilingual LLMs.

4/8/2024

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Orgest Xhelili, Yihong Liu, Hinrich Schutze

Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, $textbf{Mediterranean-Amharic-Farsi}$ and $textbf{South+East Asian Languages}$, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements. We will make our code and models publicly available at url{https://github.com/cisnlp/Transliteration-PPA}.

7/1/2024