MileBench: Benchmarking MLLMs in Long Context

2404.18532

Published 5/16/2024 by Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang

MileBench: Benchmarking MLLMs in Long Context

Abstract

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 22 models, revealed that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.

Create account to get full access

Overview

This paper introduces MileBench, a new benchmark for evaluating the performance of Massive Language Learning Models (MLLMs) on long-context tasks.
The authors argue that existing benchmarks often focus on short-form tasks, but real-world applications frequently require models to reason over long sequences of text.
MileBench includes a diverse set of tasks that assess an MLLM's ability to understand and generate text in the context of lengthy passages.

Plain English Explanation

The research paper presents a new benchmark called MileBench that is designed to test how well large language models perform on tasks that involve long passages of text. Many existing benchmarks for evaluating language models focus on short snippets of text, but in the real world, models often need to understand and generate text in the context of lengthy documents or conversations.

MileBench includes a variety of different tasks that assess an MLLM's (Massive Language Learning Model's) capabilities when working with long-form content. This could include things like summarizing long documents, answering questions about the content of lengthy passages, or generating coherent text that builds on a substantial amount of prior context.

The goal of this new benchmark is to provide a more comprehensive way to evaluate how well large language models can handle real-world scenarios that involve processing and reasoning over long stretches of information, rather than just short snippets. This could help identify areas where current models struggle and inform future research and development efforts.

Technical Explanation

The paper introduces MileBench, a new benchmark for evaluating Massive Language Learning Models (MLLMs) on tasks that require understanding and generation of text in long-form contexts. The authors argue that existing benchmarks often focus on short-form tasks, but many real-world applications demand models that can reason over lengthy passages of text.

MileBench includes a diverse set of tasks designed to assess an MLLM's capabilities in long-context scenarios. These include text summarization, question answering, text generation, and multimodal tasks that combine language with other modalities like images. The benchmark spans a range of genres and domains to provide a comprehensive evaluation.

Through extensive experiments, the authors demonstrate that current state-of-the-art MLLMs struggle on many of the long-context tasks in MileBench, highlighting areas for potential improvement. The results suggest that while these models excel at short-form tasks, they often fail to maintain coherence and consistency when reasoning over lengthy passages of text.

Critical Analysis

The MileBench benchmark appears to be a valuable contribution to the field, as it targets an important gap in the current language model evaluation landscape. By focusing on long-context tasks, the authors have identified a critical area where existing models often falter, despite their strong performance on shorter-form benchmarks.

However, the paper does acknowledge some limitations of the benchmark, such as the potential for task design and dataset curation biases to influence the results. Additionally, the authors note that the specific performance gaps observed may be influenced by the current state of MLLM architectures and training approaches, which are rapidly evolving.

Further research is needed to fully understand the underlying reasons for the long-context challenges faced by MLLMs, and to develop more effective strategies for imbuing these models with robust long-range reasoning capabilities. Potential directions for future work could include exploring the role of memory and hierarchical processing, as well as integrating multimodal information to provide richer contextual cues.

Conclusion

The MileBench paper presents a timely and important contribution to the field of language model evaluation. By shifting the focus to long-context tasks, the authors have highlighted a critical gap in the current benchmark landscape and provided a valuable tool for assessing the real-world capabilities of Massive Language Learning Models.

The results of the benchmark experiments suggest that while MLLMs have made remarkable progress on short-form tasks, they still struggle to maintain coherence and consistency when reasoning over lengthy passages of text. This underscores the need for continued research and development to address the challenges of long-context understanding and generation.

As the field of natural language processing continues to advance, benchmarks like MileBench will play a crucial role in guiding the development of more robust and versatile language models that can effectively handle the complexities of real-world language use.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

6/21/2024

cs.CL

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

6/18/2024

cs.LG cs.AI cs.CL cs.CV

Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

6/13/2024

cs.CL cs.AI

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Piji Li

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. Meanwhile, building high-quality benchmarks with much longer text lengths and more demanding tasks to provide comprehensive evaluations is of immense practical interest to facilitate long context understanding research of LLMs. However, prior benchmarks create datasets that ostensibly cater to long-text comprehension by expanding the input of traditional tasks, which falls short to exhibit the unique characteristics of long-text understanding, including long dependency tasks and longer text length compatible with modern LLMs' context window size. In this paper, we introduce a benchmark for extremely long context understanding with long-range dependencies, XL$^2$Bench, which includes three scenarios: Fiction Reading, Paper Reading, and Law Reading, and four tasks of increasing complexity: Memory Retrieval, Detailed Understanding, Overall Understanding, and Open-ended Generation, covering 27 subtasks in English and Chinese. It has an average length of 100K+ words (English) and 200K+ characters (Chinese). Evaluating six leading LLMs on XL$^2$Bench, we find that their performance significantly lags behind human levels. Moreover, the observed decline in performance across both the original and enhanced datasets underscores the efficacy of our approach to mitigating data contamination.

4/9/2024

cs.CL