Language Models can Evaluate Themselves via Probability Discrepancy

2405.10516

Published 5/20/2024 by Tingyu Xia, Bowen Yu, Yuan Wu, Yi Chang, Chang Zhou

💬

Abstract

In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.

Create account to get full access

Overview

This paper explores how the probability distribution of responses from Large Language Models (LLMs) can be used to assess their capabilities.
The authors propose a new self-evaluation method called "ProbDiff" that compares the probability discrepancy between an LLM's initial response and its revised versions.
The paper shows that ProbDiff achieves results comparable to evaluations based on the proprietary GPT-4 model across various natural language tasks and benchmarks.

Plain English Explanation

When Large Language Models (LLMs) are asked to respond to queries, more capable models tend to have a more even distribution of probabilities in their answers, compared to their less skilled counterparts. Building on this insight, the researchers developed a new self-evaluation method called ProbDiff that assesses an LLM's efficacy without relying on external, proprietary models like GPT-4.

The ProbDiff approach looks at the probability discrepancy between an LLM's initial response and its revised versions for a given query. A higher discrepancy indicates the LLM has a relatively weaker capability. The researchers found that ProbDiff produces results comparable to evaluations based on GPT-4 across a range of natural language tasks, such as translation, summarization, and blog writing, as well as established benchmarks for LLM evaluation.

Technical Explanation

The paper starts by demonstrating that more capable LLMs exhibit a more even probability distribution in their responses, compared to less skilled models. Building on this observation, the authors propose a new self-evaluation method called ProbDiff.

ProbDiff leverages the LLMs being tested to compute the probability discrepancy between their initial responses and revised versions for a given query. The idea is that a higher discrepancy indicates a relatively weaker capability for that LLM.

The researchers evaluated ProbDiff across a range of natural language generation (NLG) tasks, including translation, summarization, and a novel Xiaohongshu blog writing task. They also tested it on established LLM evaluation benchmarks like AlignBench, MT-Bench, and AlpacaEval.

The results show that ProbDiff achieves performance on par with evaluations based on the proprietary GPT-4 model, despite not relying on any external, commercially unavailable models.

Critical Analysis

The paper provides a novel and interesting approach to assessing LLM capabilities through the lens of probability distributions. The ProbDiff method is an innovative way to perform self-evaluation without the need for external, proprietary models.

However, the authors acknowledge that their approach has some limitations. For example, ProbDiff may not be as effective for models with less diverse response distributions, and it may be influenced by the specific prompts and tasks used in the evaluation.

Additionally, while the paper demonstrates the efficacy of ProbDiff across a range of scenarios, it would be valuable to see further testing on an even broader set of tasks and benchmarks to fully establish the method's robustness and generalizability.

Conclusion

This paper presents a new self-evaluation technique called ProbDiff that assesses the capabilities of Large Language Models by analyzing the probability discrepancy in their responses. The authors show that ProbDiff achieves results comparable to evaluations based on the proprietary GPT-4 model, without relying on external, commercially unavailable resources.

The ProbDiff approach offers a promising alternative for evaluating LLM performance and could help advance the field of natural language processing by providing a more accessible and transparent method for assessing model capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

Abhishek Kumar, Robert Morabito, Sanzhar Umbet, Jad Kabbara, Ali Emami

As the use of Large Language Models (LLMs) becomes more widespread, understanding their self-evaluation of confidence in generated responses becomes increasingly important as it is integral to the reliability of the output of these models. We introduce the concept of Confidence-Probability Alignment, that connects an LLM's internal confidence, quantified by token probabilities, to the confidence conveyed in the model's response when explicitly asked about its certainty. Using various datasets and prompting techniques that encourage model introspection, we probe the alignment between models' internal and expressed confidence. These techniques encompass using structured evaluation scales to rate confidence, including answer options when prompting, and eliciting the model's confidence level for outputs it does not recognize as its own. Notably, among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment, with an average Spearman's $hat{rho}$ of 0.42, across a wide range of tasks. Our work contributes to the ongoing efforts to facilitate risk assessment in the application of LLMs and to further our understanding of model trustworthiness.

6/18/2024

cs.CL cs.AI cs.LG

💬

Evaluating and Mitigating Linguistic Discrimination in Large Language Models

Guoliang Dong, Haoyu Wang, Jun Sun, Xinyu Wang

By training on text in various languages, large language models (LLMs) typically possess multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs are hard to keep the consistency of responses when faced with the same task but depicted in different languages. In this study, we first explore the consistency in the LLMs' outputs responding to queries in various languages from two aspects: safety and quality. We conduct this analysis with two datasets (AdvBench and NQ) based on four LLMs (Llama2-13b, Gemma-7b, GPT-3.5-turbo and Gemini-pro). The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish (only 1.04% of harmful queries successfully jailbreak on average) compared to queries in Bengali, Georgian, Nepali and Maithili (27.7% of harmful queries jailbreak successfully on average). Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality (with 0.1494 $F_1$ score on average) compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting, to mitigate the linguistic discrimination in LLMs. LDFighter ensures consistent service for different language speakers. We evaluate LDFighter with both benign queries and harmful queries. The results show that LDFighter not only significantly reduces the jailbreak success rate but also improve the response quality on average, demonstrating its effectiveness.

5/13/2024

cs.CL cs.AI cs.CR cs.SE

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Kehua Feng, Keyan Ding, Kede Ma, Zhihua Wang, Qiang Zhang, Huajun Chen

The past years have witnessed a proliferation of large language models (LLMs). Yet, automated and unbiased evaluation of LLMs is challenging due to the inaccuracy of standard metrics in reflecting human preferences and the inefficiency in sampling informative and diverse test examples. While human evaluation remains the gold standard, it is expensive and time-consuming, especially when dealing with a large number of testing samples. To address this problem, we propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs, whose responses are subject to three-alternative forced choice by human subjects. The pairwise comparison results are then aggregated into a global ranking using the Elo rating system. We select eight representative LLMs and compare them in terms of four skills: knowledge understanding, mathematical reasoning, writing, and coding. Experimental results show that the proposed method achieves a reliable and sensible ranking of LLMs' capabilities, identifies their relative strengths and weaknesses, and offers valuable insights for further LLM advancement.

4/15/2024

cs.LG cs.CL cs.HC

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Samuel R. Bowman, Shi Feng

Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.

4/23/2024

cs.CL cs.AI