Revisiting Meta-evaluation for Grammatical Error Correction






Published 5/28/2024 by Masamune Kobayashi, Masato Mita, Mamoru Komachi
Revisiting Meta-evaluation for Grammatical Error Correction


Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation, suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.

Create account to get full access


If you already have an account, we'll log you in


  • This paper revisits the topic of meta-evaluation for grammatical error correction (GEC) systems.
  • It examines the existing evaluation frameworks and proposes improvements to better assess the performance of GEC models.
  • The authors argue that current evaluation metrics may not accurately capture the nuances of GEC and can lead to misleading conclusions about model performance.
  • They introduce new evaluation metrics and methodologies to address the shortcomings of the existing approaches.

Plain English Explanation

Grammatical error correction (GEC) is a task in natural language processing where the goal is to automatically identify and fix errors in written text, such as incorrect grammar, spelling, or word usage. Evaluating the performance of GEC models is crucial for measuring their progress and comparing them to one another.

However, the authors of this paper argue that the current evaluation methods for GEC may not be adequate. They claim that the existing metrics and methodologies can sometimes lead to misleading conclusions about the true capabilities of GEC models. For example, a model that performs well on certain types of errors may be scored highly, even if it performs poorly on other types of errors that are equally important.

To address these issues, the authors propose new evaluation approaches that aim to provide a more comprehensive and nuanced assessment of GEC models. They introduce new metrics and methodologies that can better capture the various aspects of GEC performance, such as the ability to handle different types of errors, the level of fluency in the corrected text, and the overall quality of the corrections.

By revisiting the meta-evaluation of GEC systems, the authors hope to help researchers and practitioners develop more robust and reliable GEC models that can better meet the needs of real-world applications, such as large language models or GPT-3-based GEC systems.

Technical Explanation

The paper presents a comprehensive review of the existing evaluation frameworks for grammatical error correction (GEC) systems, and proposes several improvements to address their limitations.

The authors first discuss the current state of meta-evaluation for GEC, highlighting the shortcomings of existing metrics like F-score, which may not adequately capture the nuances of GEC performance. They argue that these metrics can lead to misleading conclusions about the true capabilities of GEC models.

To overcome these issues, the authors introduce new evaluation metrics and methodologies. These include:

  1. Error-specific Evaluation: Assessing the performance of GEC models on different types of errors (e.g., spelling, grammar, word choice) separately, rather than using a single aggregate score.
  2. Fluency-based Evaluation: Evaluating the fluency and naturalness of the corrected text, in addition to the accuracy of the corrections.
  3. Human Evaluation: Incorporating human judgments to assess the overall quality and usefulness of the GEC output.
  4. Multilingual Evaluation: Extending the evaluation to cover multiple languages, as proposed in the METAL framework.

The authors also discuss the challenges of detecting and correcting error structures within GEC and how their proposed evaluation approaches can help address these challenges.

Throughout the paper, the authors draw on examples and insights from the broader GEC literature, including recent advancements in large language models and GPT-3-based GEC systems, to contextualize their proposed improvements to meta-evaluation.

Critical Analysis

The authors make a compelling case for the need to revisit the meta-evaluation of GEC systems. Their critique of the current evaluation frameworks is well-supported, and the proposed improvements offer a more comprehensive and nuanced approach to assessing GEC performance.

One potential limitation of the paper is that the authors do not provide a detailed implementation or empirical validation of their proposed evaluation methods. While they discuss the conceptual advantages of their approach, more empirical evidence would be helpful to demonstrate the practical benefits and viability of these methods.

Additionally, the authors could have explored the potential challenges and tradeoffs associated with their proposed evaluation framework, such as the increased complexity of human evaluation or the difficulties in standardizing error-specific assessments across different GEC tasks and datasets.

Despite these minor limitations, the paper makes a valuable contribution to the GEC research community by highlighting the shortcomings of existing evaluation approaches and outlining a path forward for more robust and reliable meta-evaluation of GEC systems.


This paper presents a thoughtful analysis of the current state of meta-evaluation for grammatical error correction (GEC) systems and proposes several improvements to address the limitations of existing evaluation frameworks.

By introducing new evaluation metrics and methodologies, such as error-specific assessment, fluency-based evaluation, and multilingual approaches, the authors aim to provide a more comprehensive and nuanced way to assess the performance of GEC models. These advancements have the potential to drive further progress in the field of GEC, as researchers and practitioners can better evaluate and compare the capabilities of different models.

The authors' work also underscores the importance of rigorous and meaningful evaluation in the development of natural language processing technologies, as misleading performance metrics can hinder the advancement of the field. By revisiting the meta-evaluation of GEC, this paper lays the groundwork for more robust and reliable assessment of these critical language-based systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

New!CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

Jingheng Ye, Zishan Xu, Yinghui Li, Xuxin Cheng, Linlin Song, Qingyu Zhou, Hai-Tao Zheng, Ying Shen, Xin Su





The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics, which receives little attention in previous studies. To bridge the gap, we propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems, namely hit-correction, error-correction, under-correction, and over-correction. They collectively contribute to revealing the critical characteristics and locating drawbacks of GEC systems. Evaluating systems by Combining these dimensions leads to high human consistency over other reference-based and reference-less metrics. Extensive experiments on 2 human judgement datasets and 6 reference datasets demonstrate the effectiveness and robustness of our method. All the codes will be released after the peer review.

Read more



Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

Kostiantyn Omelianchuk, Andrii Liubonko, Oleksandr Skurzhanskyi, Artem Chernodub, Oleksandr Korniienko, Igor Samokhin





In this paper, we carry out experimental research on Grammatical Error Correction, delving into the nuances of single-model systems, comparing the efficiency of ensembling and ranking methods, and exploring the application of large language models to GEC as single-model systems, as parts of ensembles, and as ranking methods. We set new state-of-the-art performance with F_0.5 scores of 72.8 on CoNLL-2014-test and 81.4 on BEA-test, respectively. To support further advancements in GEC and ensure the reproducibility of our research, we make our code, trained models, and systems' outputs publicly available.

Read more


Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction

Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction

Masamune Kobayashi, Masato Mita, Mamoru Komachi





Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks, such as text summarization and machine translation. However, there has been a lack of research on LLMs as evaluators in grammatical error correction (GEC). In this study, we investigate the performance of LLMs in GEC evaluation by employing prompts designed to incorporate various evaluation criteria inspired by previous research. Our extensive experimental results demonstrate that GPT-4 achieved Kendall's rank correlation of 0.662 with human judgments, surpassing all existing methods. Furthermore, in recent GEC evaluations, we have underscored the significance of the LLMs scale and particularly emphasized the importance of fluency among evaluation criteria.

Read more



GPT-3.5 for Grammatical Error Correction

Anisia Katinskaia, Roman Yangarber





This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality with language models (LMs), the Scribendi test, and comparing the semantic embeddings of sentences. GPT-3.5 has a known tendency to over-correct erroneous sentences and propose alternative corrections. For several languages, such as Czech, German, Russian, Spanish, and Ukrainian, GPT-3.5 substantially alters the source sentences, including their semantics, which presents significant challenges for evaluation with reference-based metrics. For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics. However, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types, including punctuation mistakes, tense errors, syntactic dependencies between words, and lexical compatibility at the sentence level.

Read more
