Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Read original: arXiv:2408.09169 - Published 8/20/2024 by Patr'icia Schmidtov'a, Saad Mahamood, Simone Balloccu, Ondv{r}ej Duv{s}ek, Albert Gatt, Dimitra Gkatzia, David M. Howcroft, Ondv{r}ej Pl'atek, Adarsa Sivaprasad

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Overview

This paper provides a comprehensive survey of current practices in evaluating natural language generation (NLG) systems using automatic metrics.
The authors examine the strengths and limitations of popular NLG evaluation metrics, including BLEU, METEOR, ROUGE, and others.
They also discuss the role of human evaluation and how it compares to automatic metrics.

Plain English Explanation

The paper looks at the different ways researchers measure the performance of AI systems that generate human-like text. These automatic metrics are used to quickly and objectively assess the quality of the text produced by the AI, without having to rely on time-consuming human judgments.

Some of the most common automatic metrics include BLEU, METEOR, and ROUGE. These metrics work by comparing the AI-generated text to high-quality reference text written by humans. The closer the AI's output matches the reference, the higher the metric score.

The paper discusses the pros and cons of these automatic metrics. While they provide a fast and standardized way to evaluate NLG systems, they don't always align with human judgments of text quality. The authors also explore how biases in the automatic metrics can lead to misleading results.

Overall, the paper highlights the importance of carefully selecting and interpreting automatic metrics, and the continued need for human evaluation to fully assess the capabilities of natural language generation systems.

Technical Explanation

The paper begins by reviewing prior surveys on NLG evaluation, noting that automatic metrics have become increasingly prominent in recent years. The authors then provide a comprehensive overview of the most widely used automatic evaluation metrics for NLG:

BLEU: Measures n-gram overlap between the generated text and reference text.
METEOR: An extension of BLEU that also considers synonymy and paraphrasing.
ROUGE: A family of metrics that measure overlap of n-grams, longest common subsequences, and other lexical similarity measures.
Other metrics: The paper also covers perplexity, novel n-gram metrics, and metrics focused on semantic similarity and coherence.

The authors discuss the strengths and limitations of each metric, highlighting how they can fail to capture important aspects of text quality, such as fluency, relevance, and semantics. They also examine the role of human evaluation and how it compares to automatic metrics, noting key divergences and the need for a balanced approach.

Finally, the paper covers emerging research directions, such as using large language models for evaluation, and calls for greater standardization and transparency in NLG benchmarking and metric selection.

Critical Analysis

The paper provides a thorough and well-researched overview of the current state of NLG evaluation practices. The authors do an excellent job of highlighting the strengths and limitations of automatic metrics, and the need to consider human evaluation in conjunction with these metrics.

One potential limitation of the paper is that it does not delve deeply into the specific biases and failure modes of the various automatic metrics. While the authors mention these issues, a more in-depth discussion could have provided greater insights for researchers and practitioners.

Additionally, the paper could have explored the implications of the findings for the broader field of natural language processing, such as the impact of flawed evaluation practices on model development, research priorities, and real-world applications.

Overall, this paper serves as a valuable resource for anyone working in the field of natural language generation, and it underscores the importance of carefully selecting and interpreting evaluation metrics to ensure the reliable and robust development of these important AI systems.

Conclusion

This comprehensive survey of NLG evaluation practices emphasizes the need for a balanced approach that considers both automatic metrics and human judgments. While automatic metrics provide a fast and standardized way to assess text quality, they have significant limitations and can fail to capture important aspects of language generation.

The paper's findings suggest that researchers and practitioners must exercise caution when using and interpreting these metrics, and remain mindful of their biases and failure modes. Ultimately, the authors call for greater standardization and transparency in NLG benchmarking and evaluation, to ensure the reliable and robust development of natural language generation systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Patr'icia Schmidtov'a, Saad Mahamood, Simone Balloccu, Ondv{r}ej Duv{s}ek, Albert Gatt, Dimitra Gkatzia, David M. Howcroft, Ondv{r}ej Pl'atek, Adarsa Sivaprasad

Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation (NLG) tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.

8/20/2024

🤿

On the Evaluation of Machine-Generated Reports

James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users. In this perspective paper, we draw together opinions from industry and academia, and from a variety of related research areas, to present our vision for automatic report generation, and -- critically -- a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of an information need, stating the necessary background, requirements, and scope of the report. Further, the generated reports should be complete, accurate, and verifiable. These qualities, which are desirable -- if not required -- in many analytic report-writing settings, require rethinking how to build and evaluate systems that exhibit these qualities. To foster new efforts in building these systems, we present an evaluation framework that draws on ideas found in various evaluations. To test completeness and accuracy, the framework uses nuggets of information, expressed as questions and answers, that need to be part of any high-quality generated report. Additionally, evaluation of citations that map claims made in the report to their source documents ensures verifiability.

5/13/2024

Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation

Zhaokun Jiang, Ziyin Zhang

Large language models have demonstrated parallel and even superior translation performance compared to neural machine translation (NMT) systems. However, existing comparative studies between them mainly rely on automated metrics, raising questions into the feasibility of these metrics and their alignment with human judgment. The present study investigates the convergences and divergences between automated metrics and human evaluation in assessing the quality of machine translation from ChatGPT and three NMT systems. To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics. Notably, automatic assessment and human evaluation converge in measuring formal fidelity (e.g., error rates), but diverge when evaluating semantic and pragmatic fidelity, with automated metrics failing to capture the improvement of ChatGPT's translation brought by prompt engineering. These results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools at the current stage.

4/24/2024

Evaluating Automatic Metrics with Incremental Machine Translation Systems

Guojun Wu, Shay B. Cohen, Rico Sennrich

We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study confirms several previous findings in MT metrics research and demonstrates the dataset's value as a testbed for metric evaluation. We release our code at https://github.com/gjwubyron/Evo

7/4/2024