F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Read original: arXiv:2401.14869 - Published 8/21/2024 by Yu Sun, Keyu Chen, Shujie Wang, Peiji Li, Qipeng Guo, Hang Yan, Xipeng Qiu, Xuanjing Huang, Dahua Lin

🎯

Overview

Large language models (LLMs) have gained significant attention for their impressive performance.
Researchers are increasingly evaluating LLMs, but existing benchmarks focus only on instruction-following capabilities.
Previous subjective evaluation methods rely on scoring by API models, which have limited ability to discern subtle differences.

Plain English Explanation

The paper proposes a new evaluation benchmark called F-Eval to assess the fundamental abilities of large language models, such as expression, common sense, and logic. This is important because existing benchmarks only look at how well these models follow instructions, but don't evaluate their deeper understanding.

The F-Eval benchmark includes different types of tasks, like multiple-choice questions, open-ended questions, and subjective tasks that involve comparing the model's responses to references. For the subjective tasks, the researchers developed new evaluation methods that don't rely solely on scoring by other AI models, which can have trouble detecting subtle differences.

The researchers tested 13 advanced language models using F-Eval and found that their evaluation methods showed higher correlation and better ability to distinguish between the models' capabilities, compared to other evaluation approaches. They also discussed how factors like model size, dimensions, and normalization can influence performance.

Overall, the goal of F-Eval is to provide a more comprehensive way to study the fundamental abilities of large language models, beyond just their instructional capabilities.

Technical Explanation

The paper proposes a new benchmark called F-Eval to evaluate the fundamental abilities of large language models, including expression, common sense, and logic. F-Eval includes four types of tasks:

Multi-choice objective tasks: Multiple-choice questions that assess different facets of language understanding.
Open-ended objective tasks: Free-form generation tasks that evaluate language production.
Reference-based subjective tasks: Tasks where model outputs are compared to reference responses.
Reference-free subjective tasks: Subjective tasks that don't rely on reference comparisons.

For the reference-free subjective tasks, the researchers developed new evaluation methods as alternatives to scoring by API models, which have limitations in detecting subtle differences.

The researchers evaluated 13 advanced LLMs using F-Eval and found that their evaluation methods showed higher correlation coefficients and better ability to distinguish between the models' capabilities, compared to other evaluators. They also analyzed how factors like model size, dimensions, and normalization affect performance.

Critical Analysis

The researchers acknowledge that F-Eval is not a comprehensive solution and has some limitations. For example, the benchmark may not capture all the nuances of language understanding and generation. Additionally, the reference-free subjective tasks, while intended to address the limitations of API-based scoring, may introduce their own biases or inconsistencies.

Further research could explore ways to expand the scope of F-Eval, incorporate more diverse data sources, and refine the evaluation methods to make them more robust and reliable. It would also be valuable to see how F-Eval results correlate with real-world performance in various applications.

Conclusion

The F-Eval benchmark proposed in this paper represents an important step towards a more comprehensive evaluation of large language models' fundamental abilities, beyond just their instruction-following capabilities. By including a range of objective and subjective tasks, as well as novel reference-free evaluation methods, F-Eval aims to provide a more robust and informative assessment of LLMs' language understanding and generation skills.

The findings from testing 13 advanced LLMs suggest that F-Eval can better distinguish between the models' capabilities and offer more insights into their strengths and limitations. As the development of LLMs continues, tools like F-Eval will be crucial for guiding research and ensuring these models are being evaluated and used in the most meaningful and responsible ways.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎯

F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Yu Sun, Keyu Chen, Shujie Wang, Peiji Li, Qipeng Guo, Hang Yan, Xipeng Qiu, Xuanjing Huang, Dahua Lin

Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.

8/21/2024

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He

In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.

4/30/2024

🛠️

Fusion-Eval: Integrating Assistant Evaluators with LLMs

Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng

Evaluating natural language systems poses significant challenges, particularly in the realms of natural language understanding and high-level reasoning. In this paper, we introduce 'Fusion-Eval', an innovative approach that leverages Large Language Models (LLMs) to integrate insights from various assistant evaluators. The LLM is given the example to evaluate along with scores from the assistant evaluators. Each of these evaluators specializes in assessing distinct aspects of responses. Fusion-Eval achieves a 0.962 system-level Kendall-Tau correlation with humans on SummEval and a 0.744 turn-level Spearman correlation on TopicalChat, which is significantly higher than baseline methods. These results highlight Fusion-Eval's significant potential in the realm of natural language system evaluation.

6/10/2024

What is the best model? Application-driven Evaluation for Large Language Models

Shiguo Lian, Kaikai Zhao, Xinhui Liu, Xuejiao Lei, Bikun Yang, Wenjing Zhang, Kai Wang, Zhaoxiang Liu

General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt manner. To assist users in selecting the best model in practical application scenarios, i.e., choosing the model that meets the application requirements while minimizing cost, we introduce A-Eval, an application-driven LLMs evaluation benchmark for general large language models. First, we categorize evaluation tasks into five main categories and 27 sub-categories from a practical application perspective. Next, we construct a dataset comprising 678 question-and-answer pairs through a process of collecting, annotating, and reviewing. Then, we design an objective and effective evaluation method and evaluate a series of LLMs of different scales on A-Eval. Finally, we reveal interesting laws regarding model scale and task difficulty level and propose a feasible method for selecting the best model. Through A-Eval, we provide clear empirical and engineer guidance for selecting the best model, reducing barriers to selecting and using LLMs and promoting their application and development. Our benchmark is publicly available at https://github.com/UnicomAI/DataSet/tree/main/TestData/GeneralAbility.

6/18/2024