Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

Read original: arXiv:2406.11370 - Published 6/18/2024 by Han Zhou, Xingchen Wan, Yinhong Liu, Nigel Collier, Ivan Vuli'c, Anna Korhonen

Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

Overview

This research paper explores how the preferences used to evaluate large language models (LLMs) can impact the models' judgments and alignment with human values.
The study finds that using "fairer" preferences, which better represent a diverse range of human perspectives, can lead to LLMs making more human-aligned judgments.
The paper has implications for how we design preference learning systems to improve the reliability and fairness of LLM evaluations.

Plain English Explanation

Large language models (LLMs) like GPT-3 are incredibly powerful AI systems that can generate human-like text on a wide range of topics. However, these models can also exhibit concerning biases and inconsistencies in their judgments, which raises questions about how well they are aligned with human values and preferences.

This study investigated how the preferences used to train and evaluate LLMs can impact the models' outputs. The researchers found that when they used "fairer" preferences that incorporated a more diverse range of human perspectives, the LLMs made judgments that were more closely aligned with what most people would consider ethical and desirable.

In contrast, when the preferences used were narrower or biased towards certain viewpoints, the LLMs tended to make less human-aligned judgments. This suggests that the way we design the preference learning systems that guide LLM development can have a significant impact on the models' outputs.

By incorporating a wider range of human perspectives, we may be able to create LLMs that are more reliable, unbiased, and trustworthy in their judgments and outputs. This is an important consideration as these powerful AI systems become more widely deployed in real-world applications.

Technical Explanation

The researchers conducted a series of experiments to investigate how the preferences used to evaluate LLMs can affect the models' judgments. They compared LLM outputs based on "fairer" preferences, which incorporated a more diverse range of human viewpoints, to outputs based on narrower or biased preferences.

The study used large language models trained on self-generated human preferences as the basis for their experiments. They then evaluated the models' judgments on a range of ethical dilemmas and decision-making tasks, comparing the outputs under different preference conditions.

The results showed that when the preferences used were fairer and more representative of diverse human perspectives, the LLMs made judgments that were more closely aligned with what most people would consider desirable and ethical. In contrast, when the preferences were narrower or biased, the LLM outputs exhibited more concerning biases and inconsistencies.

These findings have important implications for how we design preference learning systems to guide the development of LLMs. By incorporating a wider range of human preferences, we may be able to create models that are more reliable, unbiased, and trustworthy in their judgments and outputs.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in this study. For example, they note that the preferences used in the experiments, while designed to be more representative, may still not fully capture the diversity of human values and viewpoints.

Additionally, the study focused on evaluating LLM judgments in the context of ethical dilemmas and decision-making tasks, but it's unclear how these findings would translate to other types of real-world applications. Optimizing language models for human preferences is a complex, causal challenge that may require further research and refinement.

It's also worth considering the potential risks and downsides of relying too heavily on self-generated human preferences to guide LLM development. Large language models are inconsistent and biased evaluators, and the preferences they generate may not always be representative or aligned with broader societal values.

Overall, this study provides important insights into the role of preferences in shaping LLM judgments and outputs, but there is still much work to be done to ensure these powerful AI systems are truly aligned with human values and interests.

Conclusion

This research paper highlights the significant impact that the preferences used to evaluate large language models can have on the models' judgments and alignment with human values. By incorporating a more diverse range of human perspectives, the study found that LLMs can make more human-aligned decisions and outputs.

These findings have important implications for the design of preference learning systems and the development of trustworthy and reliable large language models. As these AI systems become more widely deployed, it will be crucial to ensure that their judgments and outputs are fair, unbiased, and well-aligned with the values and preferences of the broader human population.

While this study provides a valuable step forward, there is still much work to be done to fully address the complex challenges of optimizing language models for human preferences and ensuring the fairness and consistency of LLM evaluations. Continued research and innovation in this area will be crucial for the responsible development and deployment of these powerful AI technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

Han Zhou, Xingchen Wan, Yinhong Liu, Nigel Collier, Ivan Vuli'c, Anna Korhonen

Large language models (LLMs) have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.

6/18/2024

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vuli'c, Anna Korhonen, Nigel Collier

Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human judgement, revealing that existing calibration methods aimed at mitigating biases are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts. PairS achieves state-of-the-art performance on representative evaluation tasks and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PairS benefits from calibration.

8/13/2024

Large Language Models are Inconsistent and Biased Evaluators

Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low inter-sample agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

5/6/2024

Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

Roland Daynauth, Jason Mars

The SLAM paper demonstrated that on-device Small Language Models (SLMs) are a viable and cost-effective alternative to API-based Large Language Models (LLMs), such as OpenAI's GPT-4, offering comparable performance and stability. However, SLAM also identified discrepancies between human preferences and traditional auto-evaluators. This follow-up paper explores methods to align LLM evaluator preferences with human evaluations by addressing biases, particularly toward higher token counts. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases. For instance, spearman's ranking correlation score in the Recommendation use case improved from -27.27 to 44.55. These results highlight the importance of accounting for biases in automated evaluations to ensure fair and accurate model assessments. The recalibration process enhances the reliability of automated evaluators, leading to better AI models that align with human values and expectations. This study provides a robust methodology for future research into bias correction and emphasizes the feasibility and benefits of developing human-aligned AI evaluation systems.

7/19/2024