From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

2404.07108

Published 4/12/2024 by Yongqiang Ma, Lizhi Qing, Jiawei Liu, Yangyang Kang, Yue Zhang, Wei Lu, Xiaozhong Liu, Qikai Cheng

cs.CL cs.IR

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

Abstract

Evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task, Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.

Create account to get full access

Overview

This paper proposes using "revision distance" as a metric to evaluate the quality of text generated by large language models (LLMs) from a human-centric perspective.
Revision distance measures how much a human needs to edit the generated text to make it satisfactory, which the authors argue is a more meaningful metric than model-centric metrics like perplexity.
The paper explores applications of revision distance in various LLM-based tasks, including text generation, summarization, and question answering.

Plain English Explanation

The paper is about finding a better way to evaluate the performance of large language models (LLMs) - the powerful AI systems that can generate human-like text. Traditional metrics used to assess LLMs, like "perplexity," focus on how well the model predicts the next word in a sequence. However, the authors argue that these model-centric metrics don't necessarily reflect how useful the generated text is from a human user's perspective.

Instead, the researchers propose using a metric called "revision distance" to evaluate LLMs. Revision distance measures how much a person would need to edit or revise the text generated by an LLM to make it satisfactory. The idea is that the less a human needs to change the text, the better the LLM has performed.

The paper explores applying revision distance to various LLM-based applications, such as generating original text, summarizing long documents, and answering questions. The key advantage of this human-centric approach is that it aligns more closely with the real-world usefulness of the LLM's output, rather than just its technical prowess.

Technical Explanation

The paper introduces "revision distance" as a new metric for evaluating the performance of large language models (LLMs) in text generation tasks. Revision distance measures the editing effort required for a human to make the LLM-generated text satisfactory, which the authors argue is a more meaningful evaluation than traditional model-centric metrics like perplexity.

To calculate revision distance, the researchers have human annotators make edits to the LLM-generated text until it meets their quality standards. The number of edits required is then used as the revision distance score. The authors explore applying revision distance to various LLM-based applications, including text generation, summarization, and question answering.

The key insight is that revision distance captures the human user's experience with the LLM's output, rather than just the model's internal perfor-mance. This human-centric perspective is argued to be more relevant for real-world applications of LLMs, where the ultimate goal is to generate text that requires minimal editing by end-users.

Critical Analysis

The paper makes a compelling case for using revision distance as a complement to traditional evaluation metrics for LLMs. By focusing on the human user's experience, revision distance provides a more holistic and meaningful assessment of an LLM's performance.

However, the authors acknowledge some limitations of their approach. Collecting human annotations to calculate revision distance can be time-consuming and resource-intensive, especially at scale. Additionally, the subjective nature of what constitutes "satisfactory" text may introduce some variability in the revision distance scores.

It would also be valuable to further explore how revision distance relates to other model attributes, such as coherence, factual accuracy, and fluency. Understanding these relationships could help developers optimize LLMs for real-world usability.

Overall, the revision distance metric proposed in this paper represents an important step towards a more human-centered approach to evaluating large language models. As LLMs become increasingly prevalent in various applications, such user-focused evaluation methods will be crucial for ensuring they deliver meaningful value to end-users.

Conclusion

This paper introduces "revision distance" as a novel metric for evaluating the performance of large language models (LLMs) from a human-centric perspective. By measuring the editing effort required for a person to make the LLM-generated text satisfactory, revision distance provides a more meaningful assessment of the model's real-world usefulness compared to traditional model-centric metrics.

The authors demonstrate the application of revision distance across various LLM-based tasks, including text generation, summarization, and question answering. While the approach has some practical limitations, it represents an important step towards aligning LLM evaluation with the needs and experiences of human users. As LLMs become more widely deployed, such human-centered evaluation methods will be critical for ensuring these powerful AI systems deliver tangible benefits to end-users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

❗

Leveraging Human Revisions for Improving Text-to-Layout Models

Amber Xie, Chin-Yi Cheng, Forrest Huang, Yang Li

Learning from human feedback has shown success in aligning large, pretrained models with human values. Prior works have mostly focused on learning from high-level labels, such as preferences between pairs of model outputs. On the other hand, many domains could benefit from more involved, detailed feedback, such as revisions, explanations, and reasoning of human users. Our work proposes using nuanced feedback through the form of human revisions for stronger alignment. In this paper, we ask expert designers to fix layouts generated from a generative layout model that is pretrained on a large-scale dataset of mobile screens. Then, we train a reward model based on how human designers revise these generated layouts. With the learned reward model, we optimize our model with reinforcement learning from human feedback (RLHF). Our method, Revision-Aware Reward Models ($method$), allows a generative text-to-layout model to produce more modern, designer-aligned layouts, showing the potential for utilizing human revisions and stronger forms of feedback in improving generative models.

5/24/2024

cs.CL cs.AI

Reference-based Metrics Disprove Themselves in Question Generation

Bang Nguyen, Mengxia Yu, Yun Huang, Meng Jiang

Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.

6/18/2024

cs.CL cs.AI cs.LG

Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts

Donya Rooein, Paul Rottger, Anastassia Shaitarova, Dirk Hovy

Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.

6/7/2024

cs.CL

Textual Similarity as a Key Metric in Machine Translation Quality Estimation

Kun Sun, Rong Wang

Machine Translation (MT) Quality Estimation (QE) assesses translation reliability without reference texts. This study introduces textual similarity as a new metric for QE, using sentence transformers and cosine similarity to measure semantic closeness. Analyzing data from the MLQE-PE dataset, we found that textual similarity exhibits stronger correlations with human scores than traditional metrics (hter, model evaluation etc.). Employing GAMMs as a statistical tool, we demonstrated that textual similarity consistently outperforms other metrics across multiple language pairs in predicting human scores. We also found that hter actually failed to predict human scores in QE. Our findings highlight the effectiveness of textual similarity as a robust QE metric, recommending its integration with other metrics into QE frameworks and MT system training for improved accuracy and usability.

6/12/2024

cs.CL cs.AI