MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

Read original: arXiv:2406.06565 - Published 6/12/2024 by Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You

💬

Overview

Evaluating large language models (LLMs) is challenging
Traditional ground-truth-based benchmarks fail to capture the nuance of real-world queries
LLM-as-judge benchmarks suffer from grading biases and limited query quantity
Both can become contaminated over time
User-facing evaluation like Chatbot Arena provides reliable signals but is costly and slow

Plain English Explanation

Assessing the performance of large language models (LLMs) is a complex task. The traditional approach of using pre-defined benchmark tests often falls short because they don't reflect the true diversity and complexity of real-world questions that people might ask.

Another method is to have the LLM itself grade its own performance, but this can be biased and limited in the number of queries it can handle. These benchmarks can also become less reliable over time as the models and data evolve.

User-based evaluations like Chatbot Arena provide more realistic feedback, but they are time-consuming and expensive to run.

Technical Explanation

The paper proposes a new evaluation paradigm called MixEval that strategically combines existing benchmarks to create a more efficient, high-quality evaluation system. MixEval bridges comprehensive real-world queries mined from the web with well-graded ground-truth benchmarks by matching the web-mined queries to similar ones in the existing datasets.

The authors further develop MixEval-Hard, which offers even more challenging queries for models to tackle. These benchmarks achieve a 0.96 correlation with the reliable Chatbot Arena evaluation, while being much faster and cheaper to run (6% of the time and cost of MMLU). The system also allows for dynamic updates to keep the evaluation current.

The paper provides extensive meta-evaluation and analysis of both its own and existing LLM benchmarks to deepen the community's understanding of LLM evaluation and guide future research.

Critical Analysis

The paper presents a thoughtful and innovative approach to LLM evaluation that addresses many of the shortcomings of existing methods. By strategically combining benchmarks, MixEval and MixEval-Hard offer a balance of real-world relevance and efficient, impartial grading.

However, the authors acknowledge that their approach still has limitations. For example, the web-mined queries may not fully capture the breadth of real-world usage, and the ground-truth benchmarks may have their own biases. Additionally, the dynamic update pipeline, while an advantage, could introduce new challenges over time.

Further research may be needed to explore ways to address these limitations and continue improving LLM evaluation. Continued collaboration between researchers and industry practitioners will be crucial to develop robust, scalable, and meaningful benchmarks that keep pace with the rapid advancements in large language models.

Conclusion

This paper introduces a novel approach to LLM evaluation called MixEval that combines the strengths of existing benchmark methods. By strategically mixing real-world queries with well-curated ground-truth datasets, MixEval and MixEval-Hard provide a efficient, high-quality way to assess LLM performance that correlates strongly with user-facing evaluations.

The authors' extensive analysis and meta-evaluation offer valuable insights to deepen the community's understanding of LLM evaluation, guiding future research in this critical area. As large language models continue to advance, developing reliable and scalable evaluation frameworks will be essential to ensure their responsible development and deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You

Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose MixEval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks. Based on MixEval, we further build MixEval-Hard, which offers more room for model improvement. Our benchmarks' advantages lie in (1) a 0.96 model ranking correlation with Chatbot Arena arising from the highly impartial query distribution and grading mechanism, (2) fast, cheap, and reproducible execution (6% of the time and cost of MMLU), and (3) dynamic evaluation enabled by the rapid and stable data update pipeline. We provide extensive meta-evaluation and analysis for our and existing LLM benchmarks to deepen the community's understanding of LLM evaluation and guide future research directions.

6/12/2024

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Ravi Raju, Swayambhoo Jain, Bo Li, Jonathan Li, Urmish Thakker

Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark's usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC cite{dubois2024lengthcontrolledalpacaevalsimpleway} and Arena-Hard v0.1 cite{li2024crowdsourced} are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84%) across ten top-ranked models, and agreement (84%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9% better than Arena Hard and 20% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.

8/21/2024

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

7/18/2024

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73%$, compared to an average gap of $8.03%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09%$, whereas the gap for previous benchmarks is just $14.64%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

7/2/2024