CriticEval: Evaluating Large Language Model as Critic

Read original: arXiv:2402.13764 - Published 9/12/2024 by Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, Xian-ling Mao

CriticEval: Evaluating Large Language Model as Critic

Overview

The paper proposes CriticBench, a benchmark for evaluating large language models (LLMs) as critics to assess the quality and reasoning of model responses.
It compares the critique abilities of various LLMs on a diverse set of tasks, providing insights into their strengths and weaknesses as critical evaluators.
The goal is to develop more informative and robust critique generation models to improve model evaluation and development.

Plain English Explanation

CriticBench: Evaluating Large Language Models as Critic is a research paper that explores using large language models (LLMs) as "critics" to assess the quality and reasoning of other AI model responses. The researchers created a benchmark called CriticBench that tests the critique abilities of different LLMs across a variety of tasks.

The motivation is to develop more informative and reliable critique generation models to better evaluate and improve AI systems. By having LLMs critique the outputs of other models, the researchers can gain insights into the strengths and limitations of the LLMs as critical evaluators. This could lead to advancements in how we assess and refine AI models.

Technical Explanation

Task Input (𝐼) and Response (𝑅)

The researchers define the task input (𝐼) as the context or prompt provided to an AI model, and the response (𝑅) as the output generated by the model.

Critique (𝐶)

The critique (𝐶) is the evaluation or feedback provided by the LLM critic on the quality and reasoning of the model's response (𝑅). The critique aims to identify strengths, weaknesses, and areas for improvement.

CriticBench

CriticBench is a benchmark suite that includes a diverse set of tasks and associated datasets for evaluating the critique abilities of LLMs. The tasks cover areas like common sense reasoning, factual knowledge, and task-specific reasoning.

Evaluation Metrics

The researchers use several metrics to assess the quality of the critiques generated by the LLMs, including:

Critique Quality: How informative, insightful, and helpful the critique is.
Critique Relevance: How relevant the critique is to the given task and response.
Critique Correctness: How accurate the critique is in identifying strengths, weaknesses, and areas for improvement.

Experiments

The researchers conducted experiments comparing the critique abilities of various LLMs, such as GPT-3, Chinchilla, and InstructGPT, on the CriticBench tasks. The results provide insights into the models' strengths and limitations as critical evaluators.

Critical Analysis

The paper presents a novel and promising approach to evaluating LLMs by leveraging their critique capabilities. This could lead to more informative and robust model evaluation, ultimately helping to improve the development of AI systems.

However, the researchers acknowledge that the current CriticBench tasks may not fully capture the nuances of real-world critique generation. There is room for further refinement and expansion of the benchmark to better reflect the complexities of critiquing model responses in various domains.

Additionally, the evaluation metrics used, while reasonable, may not provide a complete picture of the critique's usefulness. Incorporating feedback from human experts or end-users could provide additional valuable insights.

Conclusion

The CriticBench framework represents a promising approach to evaluating the critique abilities of LLMs, which could lead to advancements in model evaluation and development. The insights gained from the experiments conducted can inform the design of more informative and robust critique generation models, ultimately contributing to the progress of artificial intelligence research and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CriticEval: Evaluating Large Language Model as Critic

Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, Xian-ling Mao

Critique ability, i.e., the capability of Large Language Models (LLMs) to identify and rectify flaws in responses, is crucial for their applications in self-improvement and scalable oversight. While numerous studies have been proposed to evaluate critique ability of LLMs, their comprehensiveness and reliability are still limited. To overcome this problem, we introduce CriticEval, a novel benchmark designed to comprehensively and reliably evaluate critique ability of LLMs. Specifically, to ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios. It evaluates both scalar-valued and textual critiques, targeting responses of varying quality. To ensure the reliability, a large number of critiques are annotated to serve as references, enabling GPT-4 to evaluate textual critiques reliably. Extensive evaluations of open-source and closed-source LLMs first validate the reliability of evaluation in CriticEval. Then, experimental results demonstrate the promising potential of open-source LLMs, the effectiveness of critique datasets and several intriguing relationships between the critique ability and some critical factors, including task types, response qualities and critique dimensions. Datasets and evaluation toolkit for CriticEval will be publicly released.

9/12/2024

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang

Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.

6/27/2024

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, Yujiu Yang

The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.

6/4/2024

💬

PRE: A Peer Review Based Large Language Model Evaluator

Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu

The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, and inherited biases in practice, which make them incapable of supporting the sustainable development of LLMs in long term. In order to address these issues, inspired by the peer review systems widely used in academic publication process, we propose a novel framework that can automatically evaluate LLMs through a peer-review process. Specifically, for the evaluation of a specific task, we first construct a small qualification exam to select reviewers from a couple of powerful LLMs. Then, to actually evaluate the submissions written by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to rate or compare the submissions. The final ranking of evaluatee LLMs is generated based on the results provided by all reviewers. We conducted extensive experiments on text summarization tasks with eleven LLMs including GPT-4. The results demonstrate the existence of biasness when evaluating using a single LLM. Also, our PRE model outperforms all the baselines, illustrating the effectiveness of the peer review mechanism.

6/4/2024