VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

2404.05955

Published 4/10/2024 by Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Abstract

Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained abilities such as OCR, understanding, and grounding. In this paper, we introduce bench{}, a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks. bench{} consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude-3 series, and GPT-4V(ision) on bench{}, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe bench{} will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.

Create account to get full access

Overview

This paper explores the progress of Multimodal Large Language Models (MLLMs) in understanding and grounding web pages.
It introduces a new benchmark called "VisualWebBench" to evaluate MLLM performance on tasks related to web page understanding and interaction.
The benchmark includes a diverse set of tasks, such as visual question answering, visual reasoning, and web content understanding.

Plain English Explanation

The paper looks at how well large AI language models that can work with both text and images (called Multimodal Large Language Models or MLLMs) have improved at understanding and interacting with web pages. The researchers created a new testing tool called "VisualWebBench" to evaluate these models on different tasks related to web page content, like answering questions about the images and text on a webpage, reasoning about the relationships between elements, and generally understanding the meaning and purpose of web page content. This benchmark provides a way to measure how capable these advanced AI models have become at dealing with the complex and varied information found on webpages.

Technical Explanation

The paper introduces a new benchmark called "VisualWebBench" to assess the progress of Multimodal Large Language Models (MLLMs) in understanding and grounding web page content. The benchmark includes a diverse set of tasks, such as visual question answering, visual reasoning, and web content understanding. The goal is to evaluate how well these advanced language models can process the multimodal information (text, images, layout, etc.) present on web pages.

The authors argue that web page understanding is an important yet underexplored area for MLLM research, as it requires models to handle a wide range of web-specific challenges, such as irregular layouts, diverse content types, and complex interactions between different elements. By introducing VisualWebBench, the paper aims to spur further progress in this domain and provide a standardized way to benchmark MLLM capabilities.

Critical Analysis

The paper makes a strong case for the importance of web page understanding as a benchmark for MLLM development. The tasks included in VisualWebBench cover a wide range of relevant skills, from basic content comprehension to higher-level reasoning and interaction. This comprehensive approach is a strength of the benchmark.

However, the paper does not provide much detail on the specific task formulations, datasets, or evaluation metrics used in VisualWebBench. More information on the benchmark's design and implementation would be helpful for the reader to fully assess its scope and potential limitations. Additionally, the paper does not report any baseline results or comparisons to other MLLM benchmarks, such as METAL or ARE-WE, which would give the reader a better sense of how VisualWebBench fits into the broader landscape of MLLM evaluation.

Conclusion

This paper introduces a new benchmark, VisualWebBench, to assess the progress of Multimodal Large Language Models in understanding and interacting with web page content. The benchmark covers a diverse set of tasks related to web page comprehension, providing a standardized way to measure the capabilities of these advanced AI models. While more details on the benchmark's design and implementation would be helpful, the paper makes a compelling case for the importance of web page understanding as a key challenge for MLLM research and development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏅

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: url{https://mllm-judge.github.io/}.

6/12/2024

cs.CL cs.AI cs.CV

MileBench: Benchmarking MLLMs in Long Context

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 22 models, revealed that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.

5/16/2024

cs.CL cs.AI cs.CV cs.LG

🏋️

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.

6/7/2024

cs.LG cs.CL cs.CV

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.

4/30/2024

cs.CV cs.CL