AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Read original: arXiv:2311.09828 - Published 4/12/2024 by Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal, Marek Masiak, Ricardo Rei, Eleftheria Briakou, Marine Carpuat, Xuanli He, Sofia Bourhim, Andiswa Bukula and 48 others

📉

Overview

The paper addresses the challenge of accurately measuring progress in multilingual machine translation (MT) for under-resourced African languages.
Traditional evaluation metrics like BLEU have limited correlation with human judgments, while learned metrics like COMET face issues due to lack of evaluation data and limited language coverage of multilingual encoders.
The authors create high-quality human evaluation data with simplified Multidimensional Quality Metrics (MQM) guidelines and direct assessment (DA) scoring for 13 African languages.
They develop AfriCOMET, COMET-based evaluation metrics for African languages, which achieve state-of-the-art Spearman-rank correlation with human judgments.

Plain English Explanation

Measuring the performance of machine translation (MT) systems for African languages has been challenging. Traditional evaluation methods, like comparing the machine-translated text to a reference translation using a metric called BLEU, often don't align well with how humans actually judge the quality of the translations.

More advanced evaluation approaches, like COMET, have shown higher correlation with human judgments. However, these methods have been limited by the lack of high-quality human evaluation data for under-resourced African languages. Additionally, the complexity of the guidelines used for this type of human evaluation, and the limited language coverage of the multilingual models underlying the COMET approach, have made it difficult to apply these techniques to African languages.

To address these issues, the researchers in this paper created a new dataset of high-quality human evaluations for 13 diverse African languages. They used simplified guidelines for detecting translation errors and for directly assessing the overall quality of the translations. Building on this dataset, they developed a new evaluation metric called AfriCOMET, which is specifically tailored for evaluating MT systems for African languages. AfriCOMET achieves the best correlation with human judgments compared to other evaluation approaches.

Technical Explanation

The paper begins by highlighting the challenges in accurately measuring progress in multilingual machine translation (MT) for under-resourced African languages. Traditional evaluation metrics like BLEU, which rely on n-gram matching, often show weaker correlation with human judgments of translation quality.

To address this, the authors create a high-quality human evaluation dataset for 13 diverse African languages. They use simplified Multidimensional Quality Metrics (MQM) guidelines for error detection and direct assessment (DA) scoring to collect human ratings of translation quality. This dataset helps overcome the lack of evaluation data with human ratings for under-resourced African languages.

Building on this dataset, the researchers develop AfriCOMET, a COMET-based evaluation metric for African languages. AfriCOMET leverages direct assessment data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create state-of-the-art MT evaluation metrics for African languages. The authors show that AfriCOMET achieves the highest Spearman-rank correlation with human judgments (0.441) compared to other evaluation approaches.

Critical Analysis

The paper presents a valuable contribution in addressing the challenges of evaluating machine translation progress for under-resourced African languages. By creating a high-quality human evaluation dataset and developing the AfriCOMET metric, the researchers have taken important steps towards more accurate and reliable evaluation of MT systems for these languages.

However, the paper does not delve into the specific limitations of the human evaluation data collection process, such as potential biases or inconsistencies in the simplified MQM guidelines. Additionally, the performance of AfriCOMET could be further examined in the context of low-resource machine translation through retrieval augmented approaches or large language model-driven reference-less evaluation methods.

It would also be interesting to see how the AfriCOMET metric performs across different domains and text genres, as well as its applicability to other under-resourced language families beyond Africa.

Conclusion

This paper addresses a critical challenge in accurately measuring progress in multilingual machine translation for under-resourced African languages. By creating a high-quality human evaluation dataset and developing the AfriCOMET evaluation metric, the researchers have made a significant contribution to the field.

The AfriCOMET metric's strong correlation with human judgments suggests it could be a valuable tool for evaluating and improving MT systems for African languages. This work paves the way for more reliable and impactful machine translation research and development that can benefit underserved language communities across the African continent.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal, Marek Masiak, Ricardo Rei, Eleftheria Briakou, Marine Carpuat, Xuanli He, Sofia Bourhim, Andiswa Bukula, Muhidin Mohamed, Temitayo Olatoye, Tosin Adewumi, Hamam Mokayede, Christine Mwase, Wangui Kimotho, Foutse Yuehgoh, Anuoluwapo Aremu, Jessica Ojo, Shamsuddeen Hassan Muhammad, Salomey Osei, Abdul-Hakeem Omotayo, Chiamaka Chukwuneke, Perez Ogayo, Oumaima Hourrane, Salma El Anigri, Lolwethu Ndolela, Thabiso Mangwana, Shafie Abdi Mohamed, Ayinde Hassan, Oluwabusayo Olufunke Awoyomi, Lama Alkhaled, Sana Al-Azzawi, Naome A. Etori, Millicent Ochieng, Clemencia Siro, Samuel Njoroge, Eric Muchiri, Wangari Kimotho, Lyse Naomi Wamba Momo, Daud Abolade, Simbiat Ajao, Iyanuoluwa Shode, Ricky Macharm, Ruqayya Nasir Iro, Saheed S. Abdullahi, Stephen E. Moore, Bernard Opoku, Zainab Akinjobi, Abeeb Afolabi, Nnaemeka Obiefuna, Onyekachi Raphael Ogbu, Sam Brian, Verrah Akinyi Otiende, Chinedu Emmanuel Mbonu, Sakayo Toadoum Sari, Yao Lu, Pontus Stenetorp

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

4/12/2024

Pitfalls and Outlooks in Using COMET

Vil'em Zouhar, Pinzhen Chen, Tsz Kin Lam, Nikita Moghe, Barry Haddow

Since its introduction, the COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores is not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release the SacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation. The goal of this work is to help the community make more sound use of the COMET metric.

9/4/2024

xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned MT Evaluation Metrics

Daniil Larionov, Mikhail Seleznyov, Vasiliy Viskov, Alexander Panchenko, Steffen Eger

State-of-the-art trainable machine translation evaluation metrics like xCOMET achieve high correlation with human judgment but rely on large encoders (up to 10.7B parameters), making them computationally expensive and inaccessible to researchers with limited resources. To address this issue, we investigate whether the knowledge stored in these large encoders can be compressed while maintaining quality. We employ distillation, quantization, and pruning techniques to create efficient xCOMET alternatives and introduce a novel data collection pipeline for efficient black-box distillation. Our experiments show that, using quantization, xCOMET can be compressed up to three times with no quality degradation. Additionally, through distillation, we create an xCOMET-lite metric, which has only 2.6% of xCOMET-XXL parameters, but retains 92.1% of its quality. Besides, it surpasses strong small-scale metrics like COMET-22 and BLEURT-20 on the WMT22 metrics challenge dataset by 6.4%, despite using 50% fewer parameters. All code, dataset, and models are available online.

6/21/2024

Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Kamil Guttmann, Miko{l}aj Pokrywka, Adrian Charkiewicz, Artur Nowakowski

This paper explores Minimum Bayes Risk (MBR) decoding for self-improvement in machine translation (MT), particularly for domain adaptation and low-resource languages. We implement the self-improvement process by fine-tuning the model on its MBR-decoded forward translations. By employing COMET as the MBR utility metric, we aim to achieve the reranking of translations that better aligns with human preferences. The paper explores the iterative application of this approach and the potential need for language-specific MBR utility metrics. The results demonstrate significant enhancements in translation quality for all examined language pairs, including successful application to domain-adapted models and generalisation to low-resource settings. This highlights the potential of COMET-guided MBR for efficient MT self-improvement in various scenarios.

5/21/2024