MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

2402.04788

Published 6/12/2024 by Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

cs.CL cs.AI cs.CV

🏅

Abstract

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: url{https://mllm-judge.github.io/}.

Create account to get full access

Overview

This paper introduces a novel benchmark called "MLLM-as-a-Judge" to evaluate the ability of multimodal large language models (MLLMs) in assisting judges across diverse modalities.
The benchmark includes three tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking, designed to assess the judgment capacities of MLLMs.
The study reveals that while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, they exhibit significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
The paper highlights persistent challenges in the judgment capacities of large language models, including biases, hallucinatory responses, and inconsistencies, even in advanced models like GPT-4V.

Plain English Explanation

Multimodal large language models (MLLMs) are a type of artificial intelligence system that can process and generate content in multiple formats, such as text, images, and audio. These models have shown great potential in the field of artificial general intelligence, which aims to develop AI systems that can perform a wide range of tasks like humans can.

However, assessing the utility of MLLMs is challenging because there is a lack of standardized benchmarks that align with human preferences. To address this, the researchers in this paper have introduced a new benchmark called "MLLM-as-a-Judge." This benchmark is designed to test the ability of MLLMs to assist judges in various tasks, such as scoring, comparing, and ranking things across different modalities.

The study found that MLLMs can demonstrate remarkable human-like discernment when it comes to comparing pairs of items, but they have significant differences from human preferences when it comes to scoring individual items or ranking a batch of items. The researchers also identified persistent challenges in the judgment capabilities of large language models, such as various biases, inconsistencies, and even hallucinatory responses, even in advanced models like GPT-4V.

These findings highlight the need for continued research and development to improve the reliability and trustworthiness of MLLMs before they can be fully relied upon as evaluators or decision-makers. The researchers advocate for further efforts to support the ongoing enhancement of MLLM functioning as judges.

Technical Explanation

The paper introduces a novel benchmark called "MLLM-as-a-Judge" to assess the ability of multimodal large language models (MLLMs) in assisting judges across diverse modalities. The benchmark includes three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking.

In the Scoring Evaluation task, MLLMs are asked to provide numerical scores for individual items across various modalities, such as images, text, or audio. The Pair Comparison task requires MLLMs to compare and determine which of two given items is better. The Batch Ranking task challenges MLLMs to rank a set of items in order of quality or preference.

The study reveals that while MLLMs demonstrate remarkable human-like discernment in the Pair Comparison task, there is a significant divergence from human preferences in the Scoring Evaluation and Batch Ranking tasks. The researchers also uncover persistent challenges in the judgment capacities of large language models, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V.

The findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. The authors advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges.

Critical Analysis

The paper presents a valuable contribution to the field of multimodal large language models (MLLMs) by introducing a novel benchmark for assessing their abilities as judges. The MLLM-as-a-Judge benchmark provides a systematic way to evaluate the judgment capabilities of these models across different modalities and tasks.

While the study reveals the impressive human-like discernment of MLLMs in the Pair Comparison task, the significant divergence from human preferences in the Scoring Evaluation and Batch Ranking tasks raises important concerns. These findings highlight the need for further research to understand the root causes of these biases and inconsistencies in the judgment capacities of large language models.

The paper also acknowledges the limitations of the current study, suggesting that the benchmark may not fully capture the nuances of human judgment and preferences. Additionally, the researchers note that the dataset used for the benchmark may not be representative of the diverse range of modalities and tasks that judges may encounter in real-world scenarios.

To address these limitations, the authors advocate for continued efforts to enhance the judgment capabilities of MLLMs, including the exploration of novel architectures, training approaches, and the incorporation of additional modalities and tasks. They also emphasize the importance of further research to develop more comprehensive and reliable multimodal benchmarks that better align with human preferences and decision-making processes.

Conclusion

The paper introduces a novel benchmark, MLLM-as-a-Judge, to assess the ability of multimodal large language models in assisting judges across diverse modalities. While the study reveals impressive human-like discernment in certain tasks, it also highlights persistent challenges in the judgment capacities of these models, including biases, hallucinatory responses, and inconsistencies.

These findings underscore the need for continued research and development to enhance the reliability and trustworthiness of MLLMs before they can be fully relied upon as evaluators or decision-makers. The authors advocate for additional efforts dedicated to supporting the continuous improvement of MLLM functioning as judges, with the ultimate goal of leveraging these powerful models to assist and empower human decision-making processes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

6/19/2024

cs.CL cs.AI

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Wentao Ge, Shunian Chen, Guiming Hardy Chen, Zhihong Chen, Junying Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xinyi Zhang, Yichen Chai, Xiaoyu Liu, Dingjie Song, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang

Multimodal large language models (MLLMs) (e.g., GPT-4V, LLaVA, and Claude-3) have broadened the scope of AI applications. Yet, evaluating their performance presents a significant challenge owing to the inherently subjective nature of tasks that do not yield clear-cut solutions especially for those open-ended queries. Existing automatic evaluation methodologies are mainly limited in evaluating objective queries without considering real-world user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. In our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with textit{per-sample criteria} using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed textit{MLLM-Bench}, with the evaluation samples across six critical levels following the revised Bloom's Taxonomy with the ethical consideration. We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria, and that MLLM-Bench will serve as a catalyst for encouraging the development of user-centric MLLMs tailored to real-world applications. Our benchmark data, online leaderboard and submission entry are at https://mllm-bench.llmzoo.com.

4/30/2024

cs.CL

🤯

New!Humans or LLMs as the Judge? A Study on Judgement Biases

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLMs, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.

7/1/2024

cs.CL

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained abilities such as OCR, understanding, and grounding. In this paper, we introduce bench{}, a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks. bench{} consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude-3 series, and GPT-4V(ision) on bench{}, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe bench{} will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.

4/10/2024

cs.CL cs.AI