Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation

2401.06568

Published 6/7/2024 by Xu Huang, Zhirui Zhang, Xiang Geng, Yichao Du, Jiajun Chen, Shujian Huang

💬

Abstract

This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task, aiming to better understand the mechanisms behind their remarkable performance in this task. We design the controlled experiments across various input modes and model types, and employ both coarse-grained and fine-grained prompts to discern the utility of source versus reference information. We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive, indicating LLMs' inability to fully leverage the cross-lingual capability when evaluating translations. Further analysis of the fine-grained evaluation and fine-tuning experiments show similar results. These findings also suggest a potential research direction for LLMs that fully exploits the cross-lingual capability of LLMs to achieve better performance in machine translation evaluation tasks.

Create account to get full access

Overview

This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation tasks.
The researchers designed controlled experiments to understand the mechanisms behind LLMs' remarkable performance in this task.
They explored the utility of source versus reference information using both coarse-grained and fine-grained prompts.

Plain English Explanation

The researchers wanted to understand how large language models are able to evaluate the quality of machine translations so well. They conducted a series of experiments to see how much the original text (source) and the reference translation (reference) influenced the model's assessment.

They found that the reference translation was very helpful for the model in evaluating the translation quality. Surprisingly, the original text sometimes actually made the model's evaluation worse, suggesting that the model has trouble fully leveraging its cross-lingual capabilities when evaluating translations.

The researchers also did more detailed analyses to confirm these findings. Their results point to a potential research direction for improving how LLMs can use their cross-lingual skills to better evaluate machine translations.

Technical Explanation

The researchers designed controlled experiments to investigate how large language models leverage source and reference data in machine translation evaluation tasks. They explored various input modes and model types, using both coarse-grained and fine-grained prompts.

The key finding was that reference information significantly enhanced the evaluation accuracy, while source information was sometimes counterproductive. This indicates that LLMs struggle to fully utilize their cross-lingual capabilities when evaluating translations, as detailed in related research.

Further analysis, including fine-grained evaluation and fine-tuning experiments, produced similar results. These findings suggest a potential research direction for leveraging LLMs' cross-lingual skills to achieve better performance in machine translation evaluation tasks.

Critical Analysis

The paper provides valuable insights into the mechanisms underlying LLMs' performance in machine translation evaluation. However, the researchers acknowledge that their findings are limited to the specific experimental setup and datasets used.

It would be interesting to see if the results hold true for a wider range of language pairs, translation quality levels, and model architectures. Additionally, the researchers did not explore potential reasons why source information sometimes proved counterproductive, which could be an area for further investigation.

Overall, the study offers a thoughtful analysis and raises important questions about how LLMs can better leverage cross-lingual capabilities for translation evaluation. Continued research in this direction could lead to significant advancements in this field.

Conclusion

This study sheds light on how Large Language Models utilize source and reference information in machine translation evaluation tasks. The key finding is that reference data is crucial for the models' performance, while source information can sometimes be counterproductive.

These insights suggest that there is room for improvement in how LLMs leverage their cross-lingual abilities for translation assessment. Further research in this area could lead to significant advancements in machine translation evaluation and, ultimately, improve the quality of machine-translated content for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages

Vandan Mujadia, Pruthwik Mishra, Arafat Ahsan, Dipti Misra Sharma

With the primary focus on evaluating the effectiveness of large language models for automatic reference-less translation assessment, this work presents our experiments on mimicking human direct assessment to evaluate the quality of translations in English and Indian languages. We constructed a translation evaluation task where we performed zero-shot learning, in-context example-driven learning, and fine-tuning of large language models to provide a score out of 100, where 100 represents a perfect translation and 1 represents a poor translation. We compared the performance of our trained systems with existing methods such as COMET, BERT-Scorer, and LABSE, and found that the LLM-based evaluator (LLaMA-2-13B) achieves a comparable or higher overall correlation with human judgments for the considered Indian language pairs.

4/4/2024

cs.CL

How Multilingual Are Large Language Models Fine-Tuned for Translation?

Aquia Richburg, Marine Carpuat

A new paradigm for machine translation has recently emerged: fine-tuning large language models (LLM) on parallel text has been shown to outperform dedicated translation systems trained in a supervised fashion on much larger amounts of parallel data (Xu et al., 2024a; Alves et al., 2024). However, it remains unclear whether this paradigm can enable massively multilingual machine translation or whether it requires fine-tuning dedicated models for a small number of language pairs. How does translation fine-tuning impact the MT capabilities of LLMs for zero-shot languages, zero-shot language pairs, and translation tasks that do not involve English? To address these questions, we conduct an extensive empirical evaluation of the translation quality of the TOWER family of language models (Alves et al., 2024) on 132 translation tasks from the multi-parallel FLORES-200 data. We find that translation fine-tuning improves translation quality even for zero-shot languages on average, but that the impact is uneven depending on the language pairs involved. These results call for further research to effectively enable massively multilingual translation with LLMs.

6/3/2024

cs.CL cs.LG

💬

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Sang T. Truong, Duc Q. Nguyen, Toan Nguyen, Dong D. Le, Nhi N. Truong, Tho Quan, Sanmi Koyejo

Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 common tasks and 31 metrics. Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese. Moreover, our analysis indicates that models with more parameters can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or fine-tuning datasets. These insights underscore the significance of meticulous fine-tuning with high-quality datasets in enhancing LLM performance.

5/28/2024

cs.CL cs.AI

Analysis of Multi-Source Language Training in Cross-Lingual Transfer

Seong Hoon Lim, Taejun Yun, Jinhyeon Kim, Jihun Choi, Taeuk Kim

The successful adaptation of multilingual language models (LMs) to a specific language-task pair critically depends on the availability of data tailored for that condition. While cross-lingual transfer (XLT) methods have contributed to addressing this data scarcity problem, there still exists ongoing debate about the mechanisms behind their effectiveness. In this work, we focus on one of promising assumptions about inner workings of XLT, that it encourages multilingual LMs to place greater emphasis on language-agnostic or task-specific features. We test this hypothesis by examining how the patterns of XLT change with a varying number of source languages involved in the process. Our experimental findings show that the use of multiple source languages in XLT-a technique we term Multi-Source Language Training (MSLT)-leads to increased mingling of embedding spaces for different languages, supporting the claim that XLT benefits from making use of language-independent information. On the other hand, we discover that using an arbitrary combination of source languages does not always guarantee better performance. We suggest simple heuristics for identifying effective language combinations for MSLT and empirically prove its effectiveness.

6/6/2024

cs.CL