Using Natural Language Explanations to Rescale Human Judgments

Read original: arXiv:2305.14770 - Published 9/10/2024 by Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett

🌿

Overview

The paper addresses the need for high-quality human-labeled data, particularly for processes like human feedback and evaluation, as large language models (LLMs) become more prevalent.
It proposes a method to rescale ordinal annotations and explanations using LLMs to better capture the nuances in annotators' judgments.
The technique is explored in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance.

Plain English Explanation

As large language models (LLMs) become more widely used, there is an increased need for high-quality human-labeled data, such as for evaluating the performance of these models. A common practice is to have multiple people (annotators) provide judgments or ratings on a set of examples, and then use the consensus of these judgments as the final label.

However, the paper notes that annotators' judgments can differ in several ways. They may have different qualitative assessments of an example, and they may translate those assessments into the provided labeling scheme in different ways. The paper proposes a method to better capture these nuances by using the annotators' Likert ratings (a type of ordinal scale) and their explanations for those ratings.

The key idea is to feed the ratings and explanations into an LLM, which then produces a numeric score that reflects the underlying assessment of the example. This score is anchored in a scoring rubric that can be designed or modified after the initial annotation process. This allows the rubric to include distinctions that may not have been known when the original error taxonomy was created.

The paper explores this technique in the context of rating the outputs of a document-grounded question answering system, where LLMs have achieved near-human performance. The method helps rescale the raw judgments without impacting agreement and brings the scores closer to human judgments based on the same scoring rubric.

Technical Explanation

The paper proposes a method to leverage the explanations provided by human annotators along with their ordinal ratings (e.g., Likert scales) to produce numeric scores that better capture the nuances of their judgments.

The key steps are:

Annotators provide Likert ratings and corresponding natural language explanations for a set of examples.
The ratings and explanations are fed into an LLM, which is then prompted to produce a numeric score anchored in a scoring rubric.
The numeric scores should reflect the annotators' underlying assessments of the examples, even if the original labeling scheme did not capture all the relevant distinctions.

The authors explore this approach in the context of evaluating the outputs of a document-grounded question answering system, where LLMs have achieved near-human performance. They find that the method can rescale the raw judgments without impacting agreement and brings the scores closer to human judgments based on the same scoring rubric.

Critical Analysis

The paper presents a novel approach to leveraging natural language explanations to improve the quality of human-labeled data, which is a critical need as LLMs become more widely used.

One potential limitation is the reliance on the LLM's ability to accurately interpret the annotators' explanations and produce meaningful numeric scores. The paper does not provide a detailed analysis of the LLM's performance in this task, and it would be important to understand the potential sources of error or bias in the LLM's scoring.

Additionally, the paper focuses on a specific use case of document-grounded question answering, and it's unclear how the method would generalize to other types of subjective tasks or labeling schemes. Further research would be needed to understand the broader applicability of this approach.

It would also be interesting to explore the use of Bayesian statistical modeling to incorporate the annotators' explanations, which may provide a more principled way to capture the underlying uncertainty and nuances in their judgments.

Conclusion

The paper presents a promising approach to leveraging natural language explanations to improve the quality of human-labeled data, which is a critical need as LLMs become more widely used. By using LLMs to rescale ordinal annotations and explanations, the method can capture the nuances in annotators' judgments and bring the scores closer to human judgments based on a common scoring rubric.

This work has the potential to enhance the reliability and usefulness of human-labeled data, which is essential for tasks like model evaluation, feedback, and training. Further research is needed to fully understand the strengths and limitations of this approach, but the paper represents an important step forward in addressing a key challenge in the age of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

Using Natural Language Explanations to Rescale Human Judgments

Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett

The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over human judgments. However, annotators' judgments for subjective tasks can differ in many ways: they may reflect different qualitative judgments about an example, and they may be mapped to a labeling scheme in different ways. We show that these nuances can be captured by natural language explanations, and propose a method to rescale ordinal annotations and explanations using LLMs. Specifically, we feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. These scores should reflect the annotators' underlying assessments of the example. The rubric can be designed or modified after annotation, and include distinctions that may not have been known when the original error taxonomy was devised. We explore our technique in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.

9/10/2024

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs

Puxuan Yu, Daniel Cohen, Hemank Lamba, Joel Tetreault, Alex Jaimes

In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system's usefulness and trustworthiness for downstream users. While previous research has improved this notion of calibration for low complexity learning-to-rank models, the larger data demands and parameter count specific to modern neural text rankers produce unique obstacles that hamper the efficacy of methods intended for the learning-to-rank setting. This paper proposes exploiting large language models (LLMs) to provide relevance and uncertainty signals for these neural text rankers to produce scale-calibrated scores through Monte Carlo sampling of natural language explanations (NLEs). Our approach transforms the neural ranking task from ranking textual query-document pairs to ranking corresponding synthesized NLEs. Comprehensive experiments on two popular document ranking datasets show that the NLE-based calibration approach consistently outperforms past calibration methods and LLM-based methods for ranking, calibration, and query performance prediction tasks.

8/28/2024

💬

AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, Weizhu Chen

Many natural language processing (NLP) tasks rely on labeled data to train machine learning models with high performance. However, data annotation is time-consuming and expensive, especially when the task involves a large amount of data or requires specialized domains. Recently, GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. In this paper, we first claim that large language models (LLMs), such as GPT-3.5, can serve as an excellent crowdsourced annotator when provided with sufficient guidance and demonstrated examples. Accordingly, we propose AnnoLLM, an annotation system powered by LLMs, which adopts a two-step approach, explain-then-annotate. Concretely, we first prompt LLMs to provide explanations for why the specific ground truth answer/label was assigned for a given example. Then, we construct the few-shot chain-of-thought prompt with the self-generated explanation and employ it to annotate the unlabeled data with LLMs. Our experiment results on three tasks, including user input and keyword relevance assessment, BoolQ, and WiC, demonstrate that AnnoLLM surpasses or performs on par with crowdsourced annotators. Furthermore, we build the first conversation-based information retrieval dataset employing AnnoLLM. This dataset is designed to facilitate the development of retrieval models capable of retrieving pertinent documents for conversational text. Human evaluation has validated the dataset's high quality.

4/8/2024

🏋️

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Xuansheng Wu, Padmaja Pravin Saraf, Gyeong-Geon Lee, Ehsan Latif, Ninghao Liu, Xiaoming Zhai

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI's scoring process mirrors that of humans, or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy. Specifically, we prompt LLMs to generate analytic rubrics that they use to assign scores and study the alignment gap with human grading rubrics. Based on a series of experiments with various configurations of LLM settings, we reveal a notable alignment gap between human and LLM graders. While LLMs can adapt quickly to scoring tasks, they often resort to shortcuts, bypassing deeper logical reasoning expected in human grading. We found that incorporating high-quality analytical rubrics designed to reflect human grading logic can mitigate this gap and enhance LLMs' scoring accuracy. These results caution against the simplistic application of LLMs in science education and highlight the importance of aligning LLM outputs with human expectations to ensure efficient and accurate automatic scoring.

7/29/2024