Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

2404.08008

Published 4/15/2024 by Kehua Feng, Keyan Ding, Kede Ma, Zhihua Wang, Qiang Zhang, Huajun Chen

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Abstract

The past years have witnessed a proliferation of large language models (LLMs). Yet, automated and unbiased evaluation of LLMs is challenging due to the inaccuracy of standard metrics in reflecting human preferences and the inefficiency in sampling informative and diverse test examples. While human evaluation remains the gold standard, it is expensive and time-consuming, especially when dealing with a large number of testing samples. To address this problem, we propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs, whose responses are subject to three-alternative forced choice by human subjects. The pairwise comparison results are then aggregated into a global ranking using the Elo rating system. We select eight representative LLMs and compare them in terms of four skills: knowledge understanding, mathematical reasoning, writing, and coding. Experimental results show that the proposed method achieves a reliable and sensible ranking of LLMs' capabilities, identifies their relative strengths and weaknesses, and offers valuable insights for further LLM advancement.

Create account to get full access

Overview

This paper introduces a novel approach for efficiently evaluating the performance of large language models (LLMs) using human judgments.
The proposed method, called "Maximum Discrepancy Competition" (MDC), aims to identify the most discrepant examples between an LLM and human-generated responses.
The researchers demonstrate that MDC can achieve high-quality evaluations with significantly fewer human assessments compared to traditional methods.

Plain English Explanation

The paper presents a new way to evaluate the performance of large language models (LLMs) - complex AI systems that can generate human-like text. Evaluating LLMs is challenging because they can produce a vast number of outputs, making it impractical to have humans assess each one.

The researchers' approach, called "Maximum Discrepancy Competition" (MDC), focuses on finding the examples where the LLM's responses differ the most from human-generated responses. By targeting these "most discrepant" examples, the researchers show they can get high-quality evaluations using far fewer human assessments compared to traditional evaluation methods.

The key idea is to pit the LLM against humans in a competition to generate the most divergent responses. The examples where the LLM's output differs the most from human responses are then used to assess the LLM's overall performance. This "maximum discrepancy" approach allows for efficient and informative evaluations of LLMs.

Technical Explanation

The paper introduces the "Maximum Discrepancy Competition" (MDC) method for evaluating large language models (LLMs). The core idea is to identify the examples where the LLM's responses differ the most from human-generated responses, and then use those "most discrepant" examples to assess the LLM's overall performance.

The MDC process works as follows:

Prompt generation: The researchers generate a pool of diverse prompts (e.g., open-ended questions, writing tasks) that can elicit a wide range of responses.
Response collection: For each prompt, the LLM and human participants generate responses. The human responses serve as a reference for evaluating the LLM.
Discrepancy scoring: The researchers compute a discrepancy score that measures how different the LLM's response is from the human responses for each prompt.
Maximum discrepancy selection: The prompts with the highest discrepancy scores are selected as the "most discrepant" examples.
Human evaluation: Humans are asked to evaluate the LLM's performance on only the most discrepant examples.

The key insight is that by focusing the human evaluation on the most discrepant examples, the researchers can achieve high-quality assessments with significantly fewer human judgments compared to traditional evaluation methods that assess the LLM's performance across a broader set of examples.

The paper demonstrates the effectiveness of MDC through experiments on various LLMs and tasks, showing that it can provide reliable evaluations using 10-50 times fewer human assessments than alternative approaches.

Critical Analysis

The paper presents a compelling and well-designed approach for efficiently evaluating large language models using human judgments. The "maximum discrepancy" idea is a clever way to target the most informative examples, reducing the burden on human evaluators.

However, the paper does not address some potential limitations of the MDC approach. For instance, the researchers note that the prompts used in the evaluation may not be representative of real-world usage scenarios, which could limit the generalizability of the findings. Additionally, the paper does not explore how the MDC approach might perform on more subjective or open-ended tasks, where the definition of "discrepancy" could be more nuanced.

Further research could investigate the robustness of MDC to different prompt sets, task types, and LLM architectures. It would also be valuable to better understand the cognitive processes and biases that might influence human judgments in the MDC setting, and how these factors could be mitigated.

Conclusion

The paper introduces a novel method called "Maximum Discrepancy Competition" (MDC) for efficiently evaluating the performance of large language models (LLMs) using human judgments. By focusing the evaluation on the examples where the LLM's responses differ the most from human-generated outputs, the researchers demonstrate that high-quality assessments can be achieved with significantly fewer human assessments compared to traditional evaluation approaches.

The MDC method represents an important step forward in addressing the scalability challenges of LLM evaluation, which is crucial as these models become increasingly powerful and widespread. While the paper highlights some potential limitations, the MDC approach is a promising direction for making human evaluation of LLMs more practical and informative.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Language Models can Evaluate Themselves via Probability Discrepancy

Tingyu Xia, Bowen Yu, Yuan Wu, Yi Chang, Chang Zhou

In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.

5/20/2024

cs.CL cs.AI

💬

Prediction-Powered Ranking of Large Language Models

Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, Manuel Gomez Rodriguez

Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.

5/24/2024

cs.LG cs.AI cs.CL cs.CY cs.HC stat.ML

Large Language Models are Inconsistent and Biased Evaluators

Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low inter-sample agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

5/6/2024

cs.CL cs.AI

PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram

Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.

6/24/2024

cs.CL