ChartBench: A Benchmark for Complex Visual Reasoning in Charts

2312.15915

Published 6/21/2024 by Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, Jian Guo

🤯

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. However, current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. To address this, we propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning. ChartBench includes 42 categories, 66.6k charts, and 600k question-answer pairs. Notably, many charts lack data point annotations, which requires MLLMs to derive values similar to human understanding by leveraging inherent chart elements such as color, legends, and coordinate systems. We also design an enhanced evaluation metric, Acc+, to evaluate MLLMs without extensive manual or costly LLM-based evaluations. Furthermore, we propose two baselines based on the chain of thought and supervised fine-tuning to improve model performance on unannotated charts. Extensive experimental evaluations of 18 open-sourced and 3 proprietary MLLMs reveal their limitations in chart comprehension and offer valuable insights for further research. Code and dataset are publicly available at https://chartbench.github.io.

Create account to get full access

Overview

• This paper introduces ChartBench, a new benchmark for evaluating the complex visual reasoning capabilities of large language models (LLMs) on chart-based tasks.

• The benchmark covers a diverse range of chart types and question types, designed to push the limits of current multimodal LLMs.

• The authors evaluate several state-of-the-art models on ChartBench and provide insights into their strengths and weaknesses in handling complex chart-based reasoning.

Plain English Explanation

The researchers have created a new tool called ChartBench to test how well language models can understand and reason about information presented in charts and graphs. Charts can convey a lot of complex information, so this benchmark is designed to push the boundaries of what current AI models are capable of.

The benchmark includes a wide variety of chart types, like line charts, bar charts, and scatter plots, as well as different types of questions that require understanding the data, identifying trends, and drawing conclusions. This makes it a more comprehensive test than previous benchmarks, which tended to focus on simpler chart-related tasks.

The researchers evaluated several leading language models on this new benchmark and found that while the models performed reasonably well on some tasks, they struggled with the more complex reasoning required on many of the questions. This suggests there is still significant room for improvement in building AI systems that can truly understand and reason about visual information in the same way humans can.

Technical Explanation

The authors introduce ChartBench, a new benchmark designed to evaluate the complex visual reasoning capabilities of multimodal LLMs on chart-based tasks. The benchmark includes a diverse set of chart types, including line charts, bar charts, scatter plots, and others, as well as a range of question types that require understanding the data, identifying trends, and drawing conclusions.

To create the benchmark, the authors collected a large dataset of real-world charts from the web, along with human-annotated questions and answers. They designed the questions to test a variety of reasoning skills, going beyond simple retrieval or matching tasks to include questions that require deeper understanding of the chart data and the ability to generalize to make inferences.

The authors evaluate several state-of-the-art multimodal LLMs, including VisualBERT, UNITER, and [VL-T5], on the ChartBench dataset. They find that while the models perform reasonably well on some tasks, they struggle with the more complex reasoning required on many of the questions, suggesting there is significant room for improvement in this area.

Critical Analysis

The ChartBench benchmark represents an important step forward in evaluating the complex visual reasoning capabilities of multimodal LLMs. By including a diverse range of chart types and question types, the authors have created a more comprehensive and challenging test than previous benchmarks, which tended to focus on simpler tasks.

That said, the benchmark is not without its limitations. The dataset is limited to English-language charts and questions, and it's unclear how well the findings would generalize to other languages or cultural contexts. Additionally, the benchmark may not capture all the nuances of real-world chart interpretation, which can often involve domain-specific knowledge and contextual understanding.

Further research is needed to understand the specific strengths and weaknesses of different LLM architectures and training approaches when it comes to chart-based reasoning. It would also be valuable to explore ways to improve the performance of these models on the more challenging tasks, potentially through novel training techniques or the incorporation of additional modalities beyond just text and images.

Conclusion

The ChartBench benchmark represents an important step forward in evaluating the complex visual reasoning capabilities of multimodal LLMs. By assessing a range of chart types and question types, the authors have created a more comprehensive test than previous benchmarks, revealing significant room for improvement in the ability of current models to understand and reason about visual information.

The findings from this research highlight the need for continued advancements in multimodal AI systems, particularly in areas that require complex reasoning and inference. As AI becomes increasingly integrated into our daily lives, it will be crucial for these systems to be able to understand and interpret visual information with the same nuance and depth as humans. The ChartBench benchmark provides a valuable tool for driving progress in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

6/27/2024

cs.CL cs.CV

👀

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs

Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Enamul Hoque

Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts. To facilitate chart-based reasoning using natural language, various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts. These tasks pose a unique challenge, demanding both vision-language reasoning and a nuanced understanding of chart data tables, visual encodings, and natural language prompts. Despite the recent success of Large Language Models (LLMs) across diverse NLP tasks, their abilities and limitations in the realm of data visualization remain under-explored, possibly due to their lack of multi-modal capabilities. To bridge the gap, this paper presents the first comprehensive evaluation of the recently developed large vision language models (LVLMs) for chart understanding and reasoning tasks. Our evaluation includes a comprehensive assessment of LVLMs, including GPT-4V and Gemini, across four major chart reasoning tasks. Furthermore, we perform a qualitative evaluation of LVLMs' performance on a diverse range of charts, aiming to provide a thorough analysis of their strengths and weaknesses. Our findings reveal that LVLMs demonstrate impressive abilities in generating fluent texts covering high-level data insights while also encountering common problems like hallucinations, factual errors, and data bias. We highlight the key strengths and limitations of chart comprehension tasks, offering insights for future research.

6/4/2024

cs.CL

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Jingxuan Wei, Nan Xu, Guiyong Chang, Yin Luo, BiHui Yu, Ruifeng Guo

In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.

4/3/2024

cs.CV cs.AI

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Chufan Shi, Cheng Yang, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang

We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 1,000 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains(e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 191 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4V, Claude-3-opus only achieve an average score of 73.2 and 53.7, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.

6/17/2024

cs.SE cs.CL cs.CV