MELA: Multilingual Evaluation of Linguistic Acceptability

Read original: arXiv:2311.09033 - Published 6/7/2024 by Ziyin Zhang, Yikang Liu, Weifang Huang, Junyu Mao, Rui Wang, Hai Hu

🔍

Overview

Researchers present the largest benchmark to date on linguistic acceptability, called Multilingual Evaluation of Linguistic Acceptability (MELA), covering 10 languages from diverse language families.
They establish baseline performance of large language models (LLMs) on this benchmark and investigate cross-lingual transfer in acceptability judgments using XLM-R.
The researchers also conduct probing experiments to explore how syntax capability is acquired by fine-tuning XLM-R on the MELA dataset.

Plain English Explanation

The researchers have created a new benchmark, called MELA, to test how well AI language models can judge the grammatical correctness of sentences in 10 different languages. This is the largest such benchmark to date, covering a diverse set of languages.

The researchers tested several language models, including the powerful GPT-4o and the open-source XLM-R, to see how well they can perform on this benchmark. They found that GPT-4o exhibited strong multilingual abilities, outperforming the fine-tuned XLM-R model. The open-source XLM-R model lagged behind the more capable GPT-4o.

The researchers also looked at how well the language models can transfer their knowledge from one language to another. For example, they found that training the XLM-R model on just 500 examples in Icelandic led to a 23% improvement in its performance on a completely different language, Chinese. This suggests that the models are learning some general principles about language structure that can be applied across languages.

To better understand how the models acquire these language skills, the researchers probed the inner workings of the fine-tuned XLM-R model. They found that training on the MELA dataset improved the model's performance on tasks related to syntax, indicating that the model is learning important grammatical concepts through this type of training.

The researchers have made the MELA dataset publicly available for other researchers to use and build upon. This could lead to further advancements in the development of multilingual language models that can better understand and generate grammatically correct text across a wide range of languages.

Technical Explanation

The researchers present the Multilingual Evaluation of Linguistic Acceptability (MELA) benchmark, which is the largest to date for assessing the linguistic acceptability of sentences across 10 diverse languages. They establish baseline performance of large language models (LLMs), including the powerful GPT-4o and the open-source XLM-R, on this benchmark.

To investigate cross-lingual transfer in acceptability judgments, the researchers conduct experiments using the XLM-R model. They find that transfer in acceptability judgment is non-trivial, with 500 Icelandic fine-tuning examples leading to a 23% improvement in performance on a completely unrelated language, Chinese.

In pursuit of multilingual interpretability, the researchers perform probing experiments by fine-tuning XLM-R on the MELA dataset. Their results indicate that this training improves the model's performance on syntax-related tasks, suggesting that the model is learning important grammatical concepts through this type of training.

The researchers' findings show that GPT-4o exhibits strong multilingual abilities, outperforming the fine-tuned XLM-R model, while open-source multilingual models like XLM-R lag behind by a noticeable gap. This highlights the importance of continued research and development in scaling multilingual capabilities of LLMs.

The MELA dataset is publicly available at https://github.com/sjtu-compling/MELA, and the researchers' work contributes to the growing body of research on evaluating and mitigating linguistic discrimination in LLMs.

Critical Analysis

The researchers acknowledge that their experiments only cover a small set of languages, and more work is needed to fully understand the cross-lingual transfer capabilities of language models. Additionally, the Megaverse benchmark could provide a more comprehensive evaluation of multilingual performance across a wider range of tasks and languages.

While the researchers demonstrate the utility of the MELA dataset for probing syntax capabilities, it would be valuable to explore the model's performance on other linguistic phenomena, such as semantics and pragmatics, to gain a more holistic understanding of the model's linguistic competence.

Furthermore, the researchers' findings highlight the need for continued research and development in improving the multilingual capabilities of LLMs, as the open-source models still lag behind the more capable GPT-4o. Addressing this gap could have significant implications for the accessibility and inclusivity of language technology.

Conclusion

The researchers have presented the largest benchmark to date on linguistic acceptability, MELA, covering 10 diverse languages. Their findings demonstrate the strong multilingual capabilities of the GPT-4o model and the potential for cross-lingual transfer in acceptability judgments. The researchers' probing experiments also provide insights into how language models acquire syntax capabilities.

This work contributes to the growing body of research on evaluating and mitigating linguistic discrimination in LLMs and highlights the importance of continued development in scaling multilingual capabilities of LLMs. The publicly available MELA dataset and the researchers' insights can serve as a valuable resource for the broader research community.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

MELA: Multilingual Evaluation of Linguistic Acceptability

Ziyin Zhang, Yikang Liu, Weifang Huang, Junyu Mao, Rui Wang, Hai Hu

In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks. Our data is available at https://github.com/sjtu-compling/MELA.

6/7/2024

METAL: Towards Multilingual Meta-Evaluation

Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram

With the rising human-like precision of Large Language Models (LLMs) in numerous tasks, their utilization in a variety of real-world applications is becoming more prevalent. Several studies have shown that LLMs excel on many standard NLP benchmarks. However, it is challenging to evaluate LLMs due to test dataset contamination and the limitations of traditional metrics. Since human evaluations are difficult to collect, there is a growing interest in the community to use LLMs themselves as reference-free evaluators for subjective metrics. However, past work has shown that LLM-based evaluators can exhibit bias and have poor alignment with human judgments. In this study, we propose a framework for an end-to-end assessment of LLMs as evaluators in multilingual scenarios. We create a carefully curated dataset, covering 10 languages containing native speaker judgments for the task of summarization. This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL). We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2. Our results indicate that LLM-based evaluators based on GPT-4 perform the best across languages, while GPT-3.5-Turbo performs poorly. Additionally, we perform an analysis of the reasoning provided by LLM-based evaluators and find that it often does not match the reasoning provided by human judges.

4/3/2024

💬

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA models, GPT-4-Vision and Gemini-Pro-Vision. Our experiments show that larger models such as GPT-4, Gemini-Pro and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.

4/4/2024

The Model Arena for Cross-lingual Sentiment Analysis: A Comparative Study in the Era of Large Language Models

Xiliang Zhu, Shayna Gardiner, Tere Rold'an, David Rossouw

Sentiment analysis serves as a pivotal component in Natural Language Processing (NLP). Advancements in multilingual pre-trained models such as XLM-R and mT5 have contributed to the increasing interest in cross-lingual sentiment analysis. The recent emergence in Large Language Models (LLM) has significantly advanced general NLP tasks, however, the capability of such LLMs in cross-lingual sentiment analysis has not been fully studied. This work undertakes an empirical analysis to compare the cross-lingual transfer capability of public Small Multilingual Language Models (SMLM) like XLM-R, against English-centric LLMs such as Llama-3, in the context of sentiment analysis across English, Spanish, French and Chinese. Our findings reveal that among public models, SMLMs exhibit superior zero-shot cross-lingual performance relative to LLMs. However, in few-shot cross-lingual settings, public LLMs demonstrate an enhanced adaptive potential. In addition, we observe that proprietary GPT-3.5 and GPT-4 lead in zero-shot cross-lingual capability, but are outpaced by public models in few-shot scenarios.

6/28/2024