LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2406.18403

Published 6/27/2024 by Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern'andez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller and 10 others

cs.CL

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Abstract

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

Create account to get full access

Overview

• This paper presents a large-scale empirical study that evaluates the performance of large language models (LLMs) in replacing human judges across 20 NLP evaluation tasks.

• The researchers construct a "Judge-Bench" dataset, a comprehensive collection of human judgments for various NLP tasks, to facilitate the comparison between LLMs and human judges.

• The study explores the potential of LLMs to serve as reliable substitutes for human judges, with a focus on assessing their alignment with human-provided judgments and the identification of potential vulnerabilities.

Plain English Explanation

The paper investigates whether large language models (LLMs) can be used instead of human judges to evaluate the performance of natural language processing (NLP) systems. The researchers created a dataset called "Judge-Bench" that contains human judgments for a variety of NLP tasks, such as [link to "https://aimodels.fyi/papers/arxiv/judging-judges-evaluating-alignment-vulnerabilities-llms-as"]evaluating the quality of machine translations[/link] or [link to "https://aimodels.fyi/papers/arxiv/can-llm-be-personalized-judge"]assessing the coherence of generated text[/link].

The researchers then tested how well LLMs could perform these evaluation tasks compared to human judges. They wanted to see if LLMs could reliably replace human judges, and also identify any potential issues or vulnerabilities that might arise when using LLMs for these tasks.

The results of this large-scale study across 20 different NLP tasks provide insights into the feasibility of using LLMs as substitutes for human judges in evaluating NLP systems. This could have important implications for the development and [link to "https://aimodels.fyi/papers/arxiv/replacing-judges-juries-evaluating-llm-generations-panel"]deployment of NLP technologies[/link], as well as the [link to "https://aimodels.fyi/papers/arxiv/effectiveness-llms-as-annotators-comparative-overview-empirical"]use of LLMs as efficient annotators[/link] for various language-related tasks.

Technical Explanation

The paper presents a comprehensive evaluation of the performance of large language models (LLMs) in replacing human judges across 20 different NLP tasks. The researchers construct a dataset called "Judge-Bench," which contains a collection of human judgments for various NLP tasks, including [link to "https://aimodels.fyi/papers/arxiv/mllm-as-judge-assessing-multimodal-llm-as"]multimodal tasks[/link].

The study uses this dataset to compare the judgments made by LLMs, such as GPT-3 and BERT, with the human-provided judgments. The researchers assess the alignment between the LLM and human judgments, as well as identify potential vulnerabilities or biases in the LLM's performance.

The results of the study suggest that LLMs can, to some extent, replace human judges in certain NLP evaluation tasks. However, the researchers also highlight the need to carefully consider the specific task and the potential issues that may arise when using LLMs as substitutes for human judges.

Critical Analysis

The paper presents a comprehensive and well-designed study that offers valuable insights into the potential of using LLMs as substitutes for human judges in NLP evaluation tasks. The construction of the "Judge-Bench" dataset is a significant contribution, as it provides a standardized and diverse set of tasks for evaluating the performance of LLMs.

One potential limitation of the study is the use of a limited set of LLMs, as the performance of these models may vary depending on their architecture, training, and fine-tuning. Additionally, the paper does not explore the impact of task-specific fine-tuning or the use of ensemble methods, which could potentially improve the performance of LLMs in these tasks.

Furthermore, the paper does not delve deeply into the potential biases or vulnerabilities of LLMs, which could be an important consideration when using these models as substitutes for human judges. The researchers acknowledge this limitation and suggest that further research is needed to fully understand the implications of using LLMs in high-stakes evaluation tasks.

Conclusion

This paper presents a large-scale empirical study that evaluates the performance of LLMs in replacing human judges across 20 NLP evaluation tasks. The researchers construct a comprehensive "Judge-Bench" dataset and use it to compare the judgments made by LLMs with those provided by human judges.

The results suggest that LLMs can, to some extent, serve as reliable substitutes for human judges in certain NLP evaluation tasks. However, the study also highlights the need to carefully consider the specific task and the potential issues that may arise when using LLMs for these purposes.

Overall, this research provides valuable insights into the potential and limitations of using LLMs as substitutes for human judges in NLP evaluation, with important implications for the development and deployment of NLP technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

Humans or LLMs as the Judge? A Study on Judgement Biases

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLMs, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.

7/1/2024

cs.CL

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

6/19/2024

cs.CL cs.AI

Can LLM be a Personalized Judge?

Yijiang River Dong, Tiancheng Hu, Nigel Collier

Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples. Our work indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization.

6/18/2024

cs.CL cs.CY

🤷

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

5/2/2024

cs.CL cs.AI