Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

Read original: arXiv:2407.12847 - Published 7/19/2024 by Roland Daynauth, Jason Mars

Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

Overview

This paper explores the issue of token count bias in language model assessments and proposes methods to align model evaluations with human preferences.
The researchers found that current language model evaluation metrics can be biased towards longer outputs, leading to models that prioritize verbosity over quality.
They present approaches to mitigate this bias, including Fairer Preferences Elicit Improved Human-Aligned Large and Aligning Language Models with Human Preferences, to produce evaluations that better reflect human values.

Plain English Explanation

When evaluating language models, current methods often reward models that generate longer outputs, even if those outputs aren't necessarily better or more relevant. This can lead to models that prioritize verbosity over producing high-quality, concise responses that align with human preferences.

The researchers in this paper aimed to address this "token count bias" by developing new evaluation approaches that better capture what humans actually value in language model outputs. One key idea is to use Fairer Preferences Elicit Improved Human-Aligned Large to gather preference data from humans on model outputs, rather than just relying on automated metrics.

Another approach, Aligning Language Models with Human Preferences, involves training the language model to optimize directly for human-aligned objectives, rather than just maximizing metrics like perplexity. This helps ensure the model's outputs are tailored to what people actually find useful and desirable.

By addressing this token count bias, the researchers aim to produce language model evaluations that are more reflective of real-world human preferences and values. This could lead to the development of more helpful and impactful language AI systems.

Technical Explanation

The paper first identifies the issue of token count bias in current language model evaluation metrics. Existing approaches like perplexity and BLEU score tend to favor models that generate longer outputs, even if those outputs are not necessarily of higher quality or more relevant to the task.

To address this, the researchers propose two key methods. The first is Fairer Preferences Elicit Improved Human-Aligned Large, which gathers direct preference data from humans on model outputs. This allows the evaluation to more directly capture what people value, rather than relying solely on automated metrics.

The second approach is Aligning Language Models with Human Preferences, where the language model is trained to optimize for human-aligned objectives, rather than just maximizing metrics like perplexity. This helps ensure the model's outputs are tailored to what people actually find useful and desirable.

Through a series of experiments, the researchers demonstrate that these methods can effectively mitigate token count bias and produce evaluations that better align with human preferences. This includes showing how the Prediction Powered Ranking approach can be used to account for token count when assessing model performance.

Overall, the paper provides valuable insights and practical tools for developing language models that are more closely aligned with human values and needs, moving beyond the limitations of current evaluation practices highlighted in Large Language Models Are Inconsistent, Biased Evaluators.

Critical Analysis

The paper makes a compelling case for the need to address token count bias in language model assessments. The proposed methods, such as Fairer Preferences Elicit Improved Human-Aligned Large and Aligning Language Models with Human Preferences, provide promising avenues for producing evaluations that better reflect human values and preferences.

However, the paper does not fully explore the potential limitations or challenges of implementing these approaches in real-world settings. For example, gathering high-quality human preference data can be time-consuming and resource-intensive, and there may be biases or inconsistencies in how individuals evaluate language model outputs.

Additionally, the paper does not address the broader issue of the METAL: Towards Multilingual Meta-Evaluation of language models, which could be an important consideration for deploying these techniques across diverse languages and cultural contexts.

Overall, the research presented in this paper represents an important step towards aligning language model evaluations with human preferences, but further work may be needed to fully address the complexities and challenges involved in this endeavor.

Conclusion

This paper makes a significant contribution to the field of language model evaluation by identifying and proposing solutions to the issue of token count bias. By developing approaches like Fairer Preferences Elicit Improved Human-Aligned Large and Aligning Language Models with Human Preferences, the researchers aim to produce evaluations that better reflect human values and preferences.

These methods have the potential to drive the development of language AI systems that are more useful, relevant, and aligned with the needs and expectations of end-users. As the field of natural language processing continues to advance, this work highlights the importance of ensuring that technological progress is accompanied by a keen awareness of human-centered concerns and priorities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

Roland Daynauth, Jason Mars

The SLAM paper demonstrated that on-device Small Language Models (SLMs) are a viable and cost-effective alternative to API-based Large Language Models (LLMs), such as OpenAI's GPT-4, offering comparable performance and stability. However, SLAM also identified discrepancies between human preferences and traditional auto-evaluators. This follow-up paper explores methods to align LLM evaluator preferences with human evaluations by addressing biases, particularly toward higher token counts. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases. For instance, spearman's ranking correlation score in the Recommendation use case improved from -27.27 to 44.55. These results highlight the importance of accounting for biases in automated evaluations to ensure fair and accurate model assessments. The recalibration process enhances the reliability of automated evaluators, leading to better AI models that align with human values and expectations. This study provides a robust methodology for future research into bias correction and emphasizes the feasibility and benefits of developing human-aligned AI evaluation systems.

7/19/2024

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vuli'c, Anna Korhonen, Nigel Collier

Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human judgement, revealing that existing calibration methods aimed at mitigating biases are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts. PairS achieves state-of-the-art performance on representative evaluation tasks and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PairS benefits from calibration.

8/13/2024

Large Language Models are Inconsistent and Biased Evaluators

Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low inter-sample agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

5/6/2024

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Shachi H Kumar, Saurav Sahay, Sahisnu Mazumder, Eda Okur, Ramesh Manuvinakurike, Nicole Beckage, Hsuan Su, Hung-yi Lee, Lama Nachman

Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.

8/9/2024