ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment

Read original: arXiv:2305.14463 - Published 6/11/2024 by Tarek Naous, Michael J. Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu

💬

Overview

Comprehensive evaluation of large language models for multilingual readability assessment
Introduces ReadMe++, a new multilingual, multi-domain dataset with human-annotated sentence readability
Benchmarks multilingual and monolingual language models in supervised, unsupervised, and few-shot prompting settings
Explores domain generalization and cross-lingual transfer capabilities of models trained on ReadMe++

Plain English Explanation

This research paper presents a comprehensive evaluation of how well large language models, which are powerful AI systems trained on vast amounts of text, can assess the readability of sentences in multiple languages. Existing resources for evaluating readability often lack diversity in the types of content and languages covered, which limits the ability to analyze how well these models perform across different domains and languages.

To address this, the researchers introduce a new dataset called ReadMe++ that includes 9,757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different sources. Humans have annotated these sentences with readability scores, providing a benchmark to test the performance of language models.

Using the ReadMe++ dataset, the researchers evaluate how well multilingual and single-language models perform in three different settings: supervised (where the model is trained on labeled data), unsupervised (where the model tries to assess readability without specific training), and few-shot prompting (where the model is given just a few examples to learn from). The diversity of content and languages in ReadMe++ allows the researchers to explore how well these models can generalize to new domains and transfer knowledge across languages.

The results reveal some exciting capabilities, like models trained on ReadMe++ showing superior performance in assessing readability across different domains and languages. However, the researchers also identify limitations in the state-of-the-art unsupervised methods for readability assessment. Overall, this work provides a valuable new benchmark and insights to support the development of more robust multilingual readability assessment systems.

Technical Explanation

The paper presents a comprehensive evaluation of large language models for multilingual readability assessment, using a new dataset called ReadMe++ that the researchers introduce. Existing evaluation resources, such as Open Multilingual System Scoring Readability for Wikipedia, MedReadMe: A Systematic Study of Fine-Grained Sentence Readability, Tagengo: A Multilingual Chat Dataset, and MELA: A Multilingual Evaluation of Linguistic Acceptability, often lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses.

ReadMe++ includes 9,757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources, with human annotations of readability. This benchmark is designed to encourage research on developing robust multilingual readability assessment methods.

Using ReadMe++, the researchers benchmark multilingual and monolingual language models in three settings: supervised, unsupervised, and few-shot prompting. The domain and language diversity in ReadMe++ enables them to test more effective few-shot prompting and identify shortcomings in state-of-the-art unsupervised methods for readability assessment.

The experiments also reveal exciting results, including superior domain generalization and enhanced cross-lingual transfer capabilities of models trained on ReadMe++. The researchers will make the ReadMe++ dataset publicly available and release a Python package tool for multilingual sentence readability prediction using their trained models.

Critical Analysis

The researchers acknowledge that while ReadMe++ provides a valuable benchmark for evaluating multilingual readability assessment, it still has some limitations. The dataset may not capture the full complexity of real-world readability challenges, as the sentences were selected and annotated by humans rather than representing a random sample of natural language. Additionally, the researchers only evaluate a subset of large language models, and there may be other approaches or architectures that could perform even better on this task.

One area for further research would be to explore the performance of large language models in more fine-grained readability analysis, such as identifying specific linguistic features that contribute to a sentence's ease of understanding. The researchers also note that their few-shot prompting experiments could be expanded to investigate the optimal number of examples and the most effective prompting strategies for this task.

Overall, this work represents a significant contribution to the field of multilingual readability assessment, providing a new benchmark dataset and insights into the capabilities and limitations of current large language models. By encouraging further research and development in this area, the researchers aim to support the creation of more robust and accessible multilingual language technologies.

Conclusion

This paper presents a comprehensive evaluation of large language models for multilingual readability assessment, introducing a new dataset called ReadMe++ that addresses limitations in existing resources. The researchers benchmark models in supervised, unsupervised, and few-shot prompting settings, revealing exciting capabilities in domain generalization and cross-lingual transfer, as well as shortcomings in state-of-the-art unsupervised methods.

By making the ReadMe++ dataset publicly available and releasing a Python package for multilingual sentence readability prediction, the researchers hope to encourage further research and development in this important area. Improving the ability of language models to accurately assess readability across diverse domains and languages has the potential to enhance the accessibility and inclusivity of a wide range of digital technologies and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment

Tarek Naous, Michael J. Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu

We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on developing robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme

6/11/2024

An Open Multilingual System for Scoring Readability of Wikipedia

Mykola Trokhymovych, Indira Sen, Martin Gerlach

With over 60M articles, Wikipedia has become the largest platform for open and freely accessible knowledge. While it has more than 15B monthly visits, its content is believed to be inaccessible to many readers due to the lack of readability of its text. However, previous investigations of the readability of Wikipedia have been restricted to English only, and there are currently no systems supporting the automatic readability assessment of the 300+ languages in Wikipedia. To bridge this gap, we develop a multilingual model to score the readability of Wikipedia articles. To train and evaluate this model, we create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online children encyclopedias. We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages and improving upon previous benchmarks. These results demonstrate the applicability of the model at scale for languages in which there is no ground-truth data available for model fine-tuning. Furthermore, we provide the first overview on the state of readability in Wikipedia beyond English.

6/5/2024

💬

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

7/26/2024

MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain

Chao Jiang, Wei Xu

Medical texts are notoriously challenging to read. Properly measuring their readability is the first step towards making them more accessible. In this paper, we present a systematic study on fine-grained readability measurements in the medical domain at both sentence-level and span-level. We introduce a new dataset MedReadMe, which consists of manually annotated readability ratings and fine-grained complex span annotation for 4,520 sentences, featuring two novel Google-Easy and Google-Hard categories. It supports our quantitative analysis, which covers 650 linguistic features and automatic complex word and jargon identification. Enabled by our high-quality annotation, we benchmark and improve several state-of-the-art sentence-level readability metrics for the medical domain specifically, which include unsupervised, supervised, and prompting-based methods using recently developed large language models (LLMs). Informed by our fine-grained complex span annotation, we find that adding a single feature, capturing the number of jargon spans, into existing readability formulas can significantly improve their correlation with human judgments. We will publicly release the dataset and code.

5/6/2024