Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

2406.12624

Published 6/19/2024 by Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

cs.CL cs.AI

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Abstract

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

Create account to get full access

Overview

This paper evaluates the alignment and vulnerabilities of large language models (LLMs) when used as judges to assess human-generated content.
The researchers investigate potential biases, inconsistencies, and limitations of LLMs in making subjective judgments, which is a critical consideration as these models are increasingly deployed in high-stakes applications.
The paper covers several case studies that systematically examine the position bias, fine-tuning limitations, and multimodal capabilities of LLM-based judges.

Plain English Explanation

In this paper, the researchers look at how well large language models (LLMs) can act as judges to evaluate things like essays, stories, or other creative content made by humans. As these AI models become more advanced, they are starting to be used to make judgments and decisions in important areas. But the researchers wanted to see if there are any problems or biases in how these LLM "judges" perform this task.

The key idea is that LLMs, even though they are very sophisticated, may still have some limitations or blindspots when it comes to making subjective judgments. For example, an LLM-based judge model might be influenced by the order in which it sees content, or it might struggle with evaluating multimodal inputs like text and images together. The researchers ran several experiments to uncover these potential issues.

One experiment looked at whether the position of an entry in a ranking affects how the LLM judge evaluates it. Another experiment explored the limitations of fine-tuning LLMs to act as specialized judges. The researchers also investigated how well LLMs can handle judging multimodal content like text and images together.

Overall, the goal is to better understand the capabilities and vulnerabilities of using LLMs as judges, which is important as these models are increasingly used to make high-stakes decisions that impact people's lives. By identifying potential biases or flaws, the research can help improve how these AI judges are developed and deployed.

Technical Explanation

The paper presents a systematic investigation into the alignment and vulnerabilities of using large language models (LLMs) as judges to evaluate human-generated content. The researchers conducted several case studies to uncover potential biases, inconsistencies, and limitations in how LLM-based judges perform this task.

One experiment looked at position bias in pairwise comparisons, where the researchers found that an LLM-based judge model's evaluations were influenced by the order in which it saw the items being compared. This suggests that the LLM's judgments may not be entirely independent or consistent.

The researchers also explored the limitations of fine-tuning LLMs as specialized judges. They found that while fine-tuning can improve performance, these specialized judge models still exhibited biases and inconsistencies compared to human expert judgments.

Additionally, the paper investigates the capabilities of multimodal LLMs (ML-LLMs) in judging content that combines text and images. The results suggest that ML-LLMs can outperform text-only LLMs in certain multimodal evaluation tasks, but still have room for improvement.

The researchers also present a case study on replacing human judges and juries with LLM-generated judgments, highlighting the potential risks and challenges of such an approach.

Overall, the paper provides a systematic investigation into the alignment and vulnerabilities of using LLMs as judges, with the goal of better understanding the capabilities and limitations of these models in high-stakes applications that involve subjective evaluations.

Critical Analysis

The paper raises important concerns about the use of LLMs as judges, particularly in high-stakes applications where their decisions can have significant impacts on people's lives. The researchers acknowledge that while LLMs have made remarkable progress in language understanding and generation, they may still exhibit biases, inconsistencies, and limitations when tasked with making subjective judgments.

One key limitation highlighted in the paper is the position bias observed in pairwise comparisons performed by the LLM-based judge model. This suggests that the model's evaluations may be influenced by the order in which it sees the content, which could lead to unfair or inconsistent judgments. The researchers also note that fine-tuning LLMs as specialized judges, while helpful, does not fully address these underlying biases.

Additionally, the paper's exploration of multimodal LLMs (ML-LLMs) in judging content that combines text and images is an important contribution, as many real-world applications involve such multimodal inputs. The findings that ML-LLMs can outperform text-only LLMs in certain tasks are promising, but the researchers also identify areas for further improvement.

The case study on replacing human judges and juries with LLM-generated judgments is a thought-provoking and concerning scenario. While the researchers do not make a definitive claim, they highlight the potential risks and challenges of such an approach, which is a valuable consideration for policymakers and system designers.

Overall, the paper's critical examination of the alignment and vulnerabilities of LLMs-as-judges is a necessary and timely contribution to the ongoing discussion around the responsible development and deployment of AI systems, particularly in high-stakes decision-making contexts. The researchers encourage readers to think critically about these issues and to continue exploring ways to address the limitations of using LLMs as judges.

Conclusion

This paper provides a comprehensive evaluation of the alignment and vulnerabilities of using large language models (LLMs) as judges to assess human-generated content. The researchers conducted several case studies that systematically examined the position bias, fine-tuning limitations, and multimodal capabilities of LLM-based judges.

The findings suggest that while LLMs have made significant advancements in language understanding and generation, they may still exhibit biases, inconsistencies, and limitations when tasked with making subjective judgments. The researchers highlight the potential risks and challenges of using LLMs to replace human judges and juries, underscoring the importance of carefully considering the capabilities and limitations of these AI systems in high-stakes applications.

Overall, this paper offers valuable insights for researchers, policymakers, and system designers who are grappling with the responsible development and deployment of LLMs in decision-making contexts. By identifying the alignment and vulnerabilities of LLMs-as-judges, the study paves the way for further improvements and more thoughtful integration of these technologies in real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Can LLM be a Personalized Judge?

Yijiang River Dong, Tiancheng Hu, Nigel Collier

Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples. Our work indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization.

6/18/2024

cs.CL cs.CY

🏅

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: url{https://mllm-judge.github.io/}.

6/12/2024

cs.CL cs.AI cs.CV

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern'andez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andr'e F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

6/27/2024

cs.CL

🤷

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

5/2/2024

cs.CL cs.AI