PRE: A Peer Review Based Large Language Model Evaluator

2401.15641

Published 6/4/2024 by Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu

💬

Abstract

The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, and inherited biases in practice, which make them incapable of supporting the sustainable development of LLMs in long term. In order to address these issues, inspired by the peer review systems widely used in academic publication process, we propose a novel framework that can automatically evaluate LLMs through a peer-review process. Specifically, for the evaluation of a specific task, we first construct a small qualification exam to select reviewers from a couple of powerful LLMs. Then, to actually evaluate the submissions written by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to rate or compare the submissions. The final ranking of evaluatee LLMs is generated based on the results provided by all reviewers. We conducted extensive experiments on text summarization tasks with eleven LLMs including GPT-4. The results demonstrate the existence of biasness when evaluating using a single LLM. Also, our PRE model outperforms all the baselines, illustrating the effectiveness of the peer review mechanism.

Create account to get full access

Overview

Large language models (LLMs) have gained significant attention for their impressive performance on various tasks.
Effectively evaluating and comparing the capabilities of LLMs is recognized as an important but challenging problem.
Existing evaluation approaches often suffer from high costs, low generalizability, and inherited biases.
To address these issues, the researchers propose a novel framework that automatically evaluates LLMs through a peer-review process.

Plain English Explanation

The researchers have developed a new way to evaluate and compare the abilities of large language models (LLMs). LLMs are AI systems that can understand and generate human-like text, and they have shown impressive capabilities in various tasks. However, it's been challenging to assess and compare the performance of different LLMs in a reliable and scalable way.

The researchers were inspired by the peer-review process used in academic publishing. Instead of relying on human judges or other LLMs to evaluate the performance of LLMs, they've created a system where a group of LLMs act as "reviewers" to assess the submissions of other LLMs. The key steps are:

Select Reviewer LLMs: The researchers first construct a small "qualification exam" to identify a group of powerful LLMs that can serve as reviewers.
Evaluate Submissions: The reviewer LLMs then assess the submissions (e.g., text summaries) generated by the LLMs being evaluated, known as the "evaluatees."
Determine Final Rankings: The final rankings of the evaluatee LLMs are based on the combined results provided by all the reviewer LLMs.

The researchers tested this approach on text summarization tasks involving 11 different LLMs, including the powerful GPT-4 model. The results showed that using a single LLM for evaluation can introduce biases, but the peer-review approach (called the "PRE model") was able to outperform other evaluation methods, demonstrating the effectiveness of the peer-review mechanism.

Technical Explanation

The researchers propose a novel peer-review framework to automatically evaluate and compare the capabilities of large language models (LLMs). This approach aims to address the limitations of existing evaluation paradigms, which often rely on human annotators or model-based evaluators and suffer from high costs, low generalizability, and inherited biases.

The key steps of the proposed framework are:

Reviewer Selection: The researchers construct a small "qualification exam" to select a group of powerful LLMs that will serve as reviewers. These reviewer LLMs are expected to have the necessary capabilities to evaluate the submissions.
Submission Evaluation: The reviewer LLMs are then used to rate or compare the submissions (e.g., text summaries) generated by the LLMs being evaluated, known as the "evaluatees."
Final Ranking: The final rankings of the evaluatee LLMs are determined based on the combined results provided by all the reviewer LLMs.

The researchers conducted extensive experiments on text summarization tasks involving 11 LLMs, including GPT-4. The results demonstrated the existence of biases when evaluating using a single LLM, as different LLMs may have their own inherent biases. In contrast, the proposed PRE (Peer-Review Evaluation) model was able to outperform all the baseline methods, illustrating the effectiveness of the peer-review mechanism in addressing the limitations of existing evaluation approaches.

Critical Analysis

The researchers have presented an innovative approach to evaluate and compare the capabilities of large language models (LLMs) using a peer-review framework. This method addresses the shortcomings of existing evaluation paradigms, which can suffer from high costs, low generalizability, and inherent biases.

One potential limitation of the proposed approach is the reliance on the selection of appropriate reviewer LLMs. The researchers acknowledge that the "qualification exam" used to identify the reviewer LLMs is a critical step, and the performance of the overall system may be sensitive to the choice of reviewers. Further research could explore more robust and automated methods for reviewer selection.

Additionally, the researchers focused their experiments on text summarization tasks, and it would be valuable to assess the performance of the peer-review framework on a wider range of tasks and datasets to understand its broader applicability and generalizability. Expanding the evaluation to other tasks could provide additional insights into the strengths and weaknesses of the proposed approach.

Another area for further exploration is the potential for biases and inconsistencies in the evaluations provided by the reviewer LLMs. While the researchers argue that the peer-review process can mitigate biases, it would be beneficial to investigate the extent to which the reviewer LLMs themselves may introduce subtle biases or inconsistencies in their assessments.

Overall, the researchers have made an important contribution by proposing a novel framework that leverages the power of peer-review to effectively evaluate and compare the capabilities of large language models. This approach holds promise for supporting the sustainable development of LLMs and addressing the challenges inherent in existing evaluation methods.

Conclusion

The researchers have developed a novel peer-review framework to automatically evaluate and compare the capabilities of large language models (LLMs). This approach addresses the limitations of existing evaluation paradigms, which often suffer from high costs, low generalizability, and inherited biases.

By leveraging a group of powerful LLMs as "reviewers" to assess the submissions of other LLMs, the proposed framework demonstrates the ability to outperform traditional evaluation methods. The results of the experiments on text summarization tasks highlight the effectiveness of the peer-review mechanism in mitigating biases that can arise when using a single LLM for evaluation.

This innovative framework has the potential to significantly advance the field of LLM development by providing a more reliable and scalable way to assess and compare the capabilities of these powerful AI systems. As the research in large language models as partners continues to evolve, the peer-review approach presented in this work can play a crucial role in supporting the sustainable and responsible progress of LLM technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

METAL: Towards Multilingual Meta-Evaluation

Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram

With the rising human-like precision of Large Language Models (LLMs) in numerous tasks, their utilization in a variety of real-world applications is becoming more prevalent. Several studies have shown that LLMs excel on many standard NLP benchmarks. However, it is challenging to evaluate LLMs due to test dataset contamination and the limitations of traditional metrics. Since human evaluations are difficult to collect, there is a growing interest in the community to use LLMs themselves as reference-free evaluators for subjective metrics. However, past work has shown that LLM-based evaluators can exhibit bias and have poor alignment with human judgments. In this study, we propose a framework for an end-to-end assessment of LLMs as evaluators in multilingual scenarios. We create a carefully curated dataset, covering 10 languages containing native speaker judgments for the task of summarization. This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL). We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2. Our results indicate that LLM-based evaluators based on GPT-4 perform the best across languages, while GPT-3.5-Turbo performs poorly. Additionally, we perform an analysis of the reasoning provided by LLM-based evaluators and find that it often does not match the reasoning provided by human judges.

4/3/2024

cs.CL

Evaluating the Performance of Large Language Models via Debates

Behrad Moniri, Hamed Hassani, Edgar Dobriban

Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications where tasks are not always from a single domain, or rely on human input, making them unscalable. We propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as problem definition and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.

6/18/2024

cs.CL cs.AI cs.LG

💬

Large Language Models as Partners in Student Essay Evaluation

Toru Ishida, Tongxi Liu, Hailong Wang, William K. Cheung

As the importance of comprehensive evaluation in workshop courses increases, there is a growing demand for efficient and fair assessment methods that reduce the workload for faculty members. This paper presents an evaluation conducted with Large Language Models (LLMs) using actual student essays in three scenarios: 1) without providing guidance such as rubrics, 2) with pre-specified rubrics, and 3) through pairwise comparison of essays. Quantitative analysis of the results revealed a strong correlation between LLM and faculty member assessments in the pairwise comparison scenario with pre-specified rubrics, although concerns about the quality and stability of evaluations remained. Therefore, we conducted a qualitative analysis of LLM assessment comments, showing that: 1) LLMs can match the assessment capabilities of faculty members, 2) variations in LLM assessments should be interpreted as diversity rather than confusion, and 3) assessments by humans and LLMs can differ and complement each other. In conclusion, this paper suggests that LLMs should not be seen merely as assistants to faculty members but as partners in evaluation committees and outlines directions for further research.

5/30/2024

cs.CY cs.AI

What is the best model? Application-driven Evaluation for Large Language Models

Shiguo Lian, Kaikai Zhao, Xinhui Liu, Xuejiao Lei, Bikun Yang, Wenjing Zhang, Kai Wang, Zhaoxiang Liu

General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt manner. To assist users in selecting the best model in practical application scenarios, i.e., choosing the model that meets the application requirements while minimizing cost, we introduce A-Eval, an application-driven LLMs evaluation benchmark for general large language models. First, we categorize evaluation tasks into five main categories and 27 sub-categories from a practical application perspective. Next, we construct a dataset comprising 678 question-and-answer pairs through a process of collecting, annotating, and reviewing. Then, we design an objective and effective evaluation method and evaluate a series of LLMs of different scales on A-Eval. Finally, we reveal interesting laws regarding model scale and task difficulty level and propose a feasible method for selecting the best model. Through A-Eval, we provide clear empirical and engineer guidance for selecting the best model, reducing barriers to selecting and using LLMs and promoting their application and development. Our benchmark is publicly available at https://github.com/UnicomAI/DataSet/tree/main/TestData/GeneralAbility.

6/18/2024

cs.CL cs.AI