Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs

2406.07791

Published 6/13/2024 by Lin Shi, Weicheng Ma, Soroush Vosoughi

Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs

Abstract

LLM-as-a-Judge offers a promising alternative to human judges across various tasks, yet inherent biases, particularly position bias - a systematic preference for answers based on their position in the prompt - compromise its effectiveness. Our study investigates this issue by developing a framework to systematically study and quantify position bias using metrics such as repetitional consistency, positional consistency, and positional fairness. We conduct experiments with 9 judge models across 22 tasks from the MTBench and DevBench benchmarks and nearly 40 answer-generating models, generating approximately 80,000 evaluation instances. This comprehensive assessment reveals significant variations in bias across judges and tasks. Although GPT-4 often excels in positional consistency and fairness, some more cost-effective models perform comparably or even better in specific tasks, highlighting essential trade-offs between consistency, fairness, and cost. Our results also demonstrate high consistency of judgment across repetitions, confirming that position bias is not due to random variations. This research significantly contributes to the field by introducing new concepts for understanding position bias and providing a multi-dimensional framework for evaluation. These insights guide the selection of optimal judge models, enhance benchmark design, and lay the foundation for future research into effective debiasing strategies, ultimately enhancing the reliability of LLM evaluators.

Create account to get full access

Overview

Investigates position bias in pairwise comparative assessments by large language models (LLMs)
Explores how the placement of items being compared can influence the judgments of LLMs
Provides insights into the reliability and fairness of using LLMs as judges in various applications

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can be used to evaluate and compare different items or options. However, the researchers behind this paper Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs found that the position of the items being compared can significantly influence the judgments of these models.

The researchers discovered that LLMs tend to have a "position bias," meaning they are more likely to select the item that appears first in a comparison, even if the two items are of equal quality. This bias could lead to unfair or inaccurate assessments, especially in applications where LLMs are used to judge or evaluate things like product reviews, creative writing, or spatial relationships.

To understand this position bias, the researchers conducted a series of experiments where they presented LLMs with pairs of items (e.g., sentences, images) and asked them to choose the better one. By systematically varying the position of the items, they were able to demonstrate that the LLMs were more likely to select the item that appeared first, even when the two items were of equal quality.

Technical Explanation

The researchers Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs designed a series of experiments to investigate position bias in LLMs' pairwise comparative assessments. They used a diverse set of datasets, including text-based comparisons (e.g., sentences, summaries) and multimodal comparisons (e.g., images, data visualizations).

In each experiment, the researchers presented the LLMs with pairs of items and asked them to select the better one. By systematically varying the position of the items (left/right or top/bottom), they were able to measure the extent to which the LLMs' judgments were influenced by the item's position.

The results showed a clear position bias, where the LLMs were significantly more likely to select the item that appeared first in the comparison, even when the two items were of equal quality. This bias was observed across a range of datasets and task types, suggesting that it is a prevalent issue in LLM-based comparative assessments.

The researchers also explored potential factors that might contribute to this position bias, such as the LLMs' tendency to process information sequentially and the inherent biases in the training data used to develop these models.

Critical Analysis

The researchers acknowledge several caveats and limitations in their study. For example, they note that the magnitude of the position bias may vary depending on the specific LLM architecture, task, and dataset used. Additionally, they suggest that further research is needed to investigate the underlying mechanisms driving the position bias and explore potential mitigation strategies.

One potential concern is that the position bias observed in this study could have significant implications for the reliability and fairness of using LLMs as judges or evaluators in real-world applications, such as product reviews, creative writing, or spatial relationships. If the LLMs' judgments are systematically biased by the position of the items being compared, it could lead to unfair or inaccurate assessments, with potentially significant consequences for individuals or organizations relying on these judgments.

It is also worth considering whether the position bias observed in this study is unique to LLMs or if it might also be present in human decision-making. Further research comparing the position bias of LLMs and human judges could provide valuable insights into the nature and generalizability of this bias.

Conclusion

The paper "Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs" highlights a concerning position bias in how large language models (LLMs) evaluate and compare different items. This bias could undermine the reliability and fairness of using LLMs as judges or evaluators in various applications, such as product reviews, creative writing, and spatial relationships.

The researchers' findings call for further investigation into the underlying mechanisms driving this position bias and the development of strategies to mitigate it. As LLMs continue to be adopted in high-stakes decision-making contexts, understanding and addressing these biases will be crucial for ensuring the trustworthiness and fairness of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

6/19/2024

cs.CL cs.AI

🤯

New!Humans or LLMs as the Judge? A Study on Judgement Biases

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLMs, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.

7/1/2024

cs.CL

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern'andez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andr'e F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

6/27/2024

cs.CL

Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons

Adian Liusie, Vatsal Raina, Yassir Fathullah, Mark Gales

LLM-as-a-judge approaches are a practical and effective way of assessing a range of text tasks, aligning with human judgements especially when applied in a comparative assessment fashion. However, when using pairwise comparisons to rank a set of candidates the computational costs scale quadratically with the number of candidates, which can have practical limitations. This paper introduces a Product of Expert (PoE) framework for efficient LLM Comparative Assessment. Here individual comparisons are considered experts that provide information on a pair's score difference. The PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates, and is highly flexible where any form of expert can be assumed. When Gaussian experts are used one can derive simple closed-form solutions for the optimal candidate ranking, as well as expressions for selecting which comparisons should be made to maximize the probability of this ranking. Our approach enables efficient comparative assessment, where by using only a small subset of the possible comparisons, one can generate score predictions that correlate as well to human judgements as the predictions when all comparisons are used. We evaluate the approach on multiple NLG tasks and demonstrate that our framework can yield considerable computational savings when performing pairwise comparative assessment. When N is large, with as few as 2% of comparisons the PoE solution can achieve similar performance to when all comparisons are used.

6/11/2024

cs.CL