CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

2406.18521

Published 6/27/2024 by Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi and 3 others

cs.CL cs.CV

🤔

Abstract

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

Create account to get full access

Overview

The paper highlights the importance of chart understanding in applying Multimodal Large Language Models (MLLMs) to real-world tasks.
It argues that existing datasets often focus on oversimplified and homogeneous charts, leading to an over-optimistic measure of progress.
The paper proposes CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers.
CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements, and 2) reasoning questions that require synthesizing information across complex visual elements.

Plain English Explanation

Charts and visualizations play a crucial role in understanding complex information, such as scientific papers or financial reports. However, the researchers argue that existing datasets used to evaluate how well language models can understand and reason about charts are often too simple and uniform. This can give an overly positive impression of the models' capabilities.

To address this issue, the researchers developed CharXiv, a new dataset of over 2,300 real-world charts from academic papers. These charts are more diverse and challenging than the ones typically used in existing benchmarks. The dataset includes two types of questions: those that test the model's ability to describe basic chart elements, and those that require the model to synthesize information across the chart to answer more complex, reasoning-based questions.

By using this more realistic and diverse dataset, the researchers found that the performance of even the strongest language models, including some of the most advanced proprietary models, is significantly lower than what previous benchmarks had suggested. This highlights the limitations in the chart understanding capabilities of current Multimodal Large Language Models (MLLMs) and the need for further research and development in this area.

Technical Explanation

The paper begins by emphasizing the importance of chart understanding when applying Multimodal Large Language Models (MLLMs) to real-world tasks, such as analyzing scientific papers or financial reports. However, the authors argue that existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress.

To address this issue, the researchers propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart.

The authors demonstrate that although open-source models can appear to outperform strong proprietary models on these oversimplified benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. Their results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. Importantly, all models lag far behind human performance of 80.5%, underscoring the weaknesses in the chart understanding capabilities of existing MLLMs.

Critical Analysis

The researchers acknowledge that while existing datasets, such as MChatQA and ChartMimic, have advanced the field of chart understanding, they often focus on oversimplified and homogeneous charts, leading to an overestimation of progress.

The introduction of CharXiv represents a significant step forward in creating a more realistic and challenging benchmark for evaluating the chart understanding capabilities of MLLMs. By including a diverse set of natural charts and a range of question types, the researchers aim to provide a more faithful measure of progress in this domain.

However, the paper does not address potential biases or limitations in the CharXiv dataset itself. For example, the selection of charts and questions may still be influenced by the researchers' own perspectives and experiences. Expanding the dataset curation to involve a broader range of domain experts could help address this concern.

Additionally, the paper focuses on the performance of current models, but does not provide detailed insights into the specific weaknesses or failure modes of these models. Further analysis and interpretation of the model behaviors could help guide future research and development efforts in this area.

Conclusion

The CharXiv dataset proposed in this paper represents a significant contribution to the field of chart understanding, providing a more realistic and challenging benchmark for evaluating the capabilities of Multimodal Large Language Models (MLLMs). The researchers' findings reveal a substantial gap between the reasoning skills of the strongest proprietary and open-source models, as well as a large disparity between model performance and human-level understanding.

These insights underscore the need for continued research and development in the area of chart understanding, which is crucial for the effective application of MLLMs to real-world tasks, such as analyzing scientific papers or financial reports. By fostering a more rigorous and faithful evaluation of model capabilities, the CharXiv dataset has the potential to drive progress and advancements in this important field of study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, Jian Guo

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. However, current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. To address this, we propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning. ChartBench includes 42 categories, 66.6k charts, and 600k question-answer pairs. Notably, many charts lack data point annotations, which requires MLLMs to derive values similar to human understanding by leveraging inherent chart elements such as color, legends, and coordinate systems. We also design an enhanced evaluation metric, Acc+, to evaluate MLLMs without extensive manual or costly LLM-based evaluations. Furthermore, we propose two baselines based on the chain of thought and supervised fine-tuning to improve model performance on unannotated charts. Extensive experimental evaluations of 18 open-sourced and 3 proprietary MLLMs reveal their limitations in chart comprehension and offer valuable insights for further research. Code and dataset are publicly available at https://chartbench.github.io.

6/21/2024

cs.CV

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Jingxuan Wei, Nan Xu, Guiyong Chang, Yin Luo, BiHui Yu, Ruifeng Guo

In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.

4/3/2024

cs.CV cs.AI

👀

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs

Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Enamul Hoque

Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts. To facilitate chart-based reasoning using natural language, various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts. These tasks pose a unique challenge, demanding both vision-language reasoning and a nuanced understanding of chart data tables, visual encodings, and natural language prompts. Despite the recent success of Large Language Models (LLMs) across diverse NLP tasks, their abilities and limitations in the realm of data visualization remain under-explored, possibly due to their lack of multi-modal capabilities. To bridge the gap, this paper presents the first comprehensive evaluation of the recently developed large vision language models (LVLMs) for chart understanding and reasoning tasks. Our evaluation includes a comprehensive assessment of LVLMs, including GPT-4V and Gemini, across four major chart reasoning tasks. Furthermore, we perform a qualitative evaluation of LVLMs' performance on a diverse range of charts, aiming to provide a thorough analysis of their strengths and weaknesses. Our findings reveal that LVLMs demonstrate impressive abilities in generating fluent texts covering high-level data insights while also encountering common problems like hallucinations, factual errors, and data bias. We highlight the key strengths and limitations of chart comprehension tasks, offering insights for future research.

6/4/2024

cs.CL

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Chufan Shi, Cheng Yang, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang

We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 1,000 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains(e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 191 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4V, Claude-3-opus only achieve an average score of 73.2 and 53.7, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.

6/17/2024

cs.SE cs.CL cs.CV