On Pre-training of Multimodal Language Models Customized for Chart Understanding

Read original: arXiv:2407.14506 - Published 8/2/2024 by Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal

💬

Overview

Recent studies have explored customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks, such as scientific chart comprehension.
These studies often use visual instruction tuning and specialized datasets to improve question-answering accuracy within the chart domain.
However, they may overlook the fundamental differences between natural image-caption pre-training data and digital chart image-QA data, particularly in the models' ability to extract underlying numeric values from charts.
This paper addresses this oversight by investigating training processes to enhance MLLMs' comprehension of charts.

Plain English Explanation

Multimodal Large Language Models (MLLMs) are powerful AI systems that can understand and generate human language, as well as process and interpret visual information. Recent research has focused on customizing these models for specific tasks, such as understanding and answering questions about scientific charts and graphs.

The researchers behind this paper noticed that while these customized MLLMs often perform well on chart-related question-answering tasks, they may struggle to fully comprehend the underlying numeric data represented in the charts. This is because the pre-training data used to develop these models, which typically consists of natural images with captions, is quite different from the digital charts and associated question-answering data used in the specialized tasks.

To address this issue, the researchers explored several training techniques to improve the models' ability to extract and reason about the numeric values within charts. Their key findings include:

Incorporating raw data values: Pre-training the models to align the chart images with their underlying numeric data significantly improves their comprehension of the chart content.
Replacing images with text: Randomly replacing the chart images with their textual representations during the fine-tuning process helps the models transfer their language reasoning capabilities to the chart interpretation task.
Extracting data before answering: Requiring the models to first extract the chart's numeric data and then use that information to answer questions further boosts the accuracy of their responses.

Based on these insights, the researchers developed a specialized MLLM called CHOPINLLM, which is designed for in-depth chart comprehension. CHOPINLLM exhibits strong performance in understanding both annotated and unannotated charts across a wide range of types.

Technical Explanation

The researchers explored three key techniques to enhance MLLMs' chart comprehension capabilities:

Incorporating raw data values: To better align the models with the numeric data represented in charts, the researchers incorporated the raw data values into the pre-training process. This "data alignment pre-training" helped the models learn to associate the visual chart elements with their underlying numeric information.
Replacing images with text: During the fine-tuning stage, the researchers randomly replaced the chart images with their textual representations. This "text-based fine-tuning" helped the models transfer their language reasoning abilities to the chart interpretation task, as they had to learn to extract information from the textual descriptions.
Extracting data before answering: The researchers also explored a two-stage fine-tuning approach, where the models were first required to extract the numeric data from the charts, and then use that information to answer the questions. This "data extraction-based fine-tuning" further improved the accuracy of the models' chart comprehension.

By incorporating these training techniques, the researchers developed CHOPINLLM, a specialized MLLM tailored for in-depth chart comprehension. CHOPINLLM demonstrated strong performance in understanding various types of charts, including unannotated ones, while maintaining robust reasoning abilities.

Critical Analysis

The researchers acknowledge that their study is focused on improving the numeric data comprehension capabilities of MLLMs, and they do not address other aspects of chart understanding, such as the interpretation of visual elements or the ability to understand the context and purpose of the charts.

Additionally, the researchers used a limited set of chart types and datasets in their experiments. While they establish a new benchmark for evaluating MLLMs' chart comprehension, further research is needed to assess the generalizability of their findings across a wider range of chart types and real-world scenarios.

It would also be valuable to explore the performance of CHOPINLLM on tasks beyond question-answering, such as chart generation, summarization, or anomaly detection, to fully assess the capabilities of their specialized MLLM.

Conclusion

This paper presents a novel approach to enhancing the chart comprehension capabilities of Multimodal Large Language Models (MLLMs). By incorporating techniques like data alignment pre-training, text-based fine-tuning, and data extraction-based fine-tuning, the researchers were able to develop CHOPINLLM, a specialized MLLM that demonstrates strong performance in understanding various types of charts, including unannotated ones.

The findings of this study have important implications for the development of AI systems that can effectively interpret and reason about complex visual data, such as scientific charts and graphs. By bridging the gap between natural image-caption data and the unique characteristics of digital chart data, the researchers have made a significant contribution to the field of multimodal language understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

On Pre-training of Multimodal Language Models Customized for Chart Understanding

Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal

Recent studies customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks have yielded promising results, especially in the field of scientific chart comprehension. These studies generally utilize visual instruction tuning with specialized datasets to enhance question and answer (QA) accuracy within the chart domain. However, they often neglect the fundamental discrepancy between natural image-caption pre-training data and digital chart image-QA data, particularly in the models' capacity to extract underlying numeric values from charts. This paper tackles this oversight by exploring the training processes necessary to improve MLLMs' comprehension of charts. We present three key findings: (1) Incorporating raw data values in alignment pre-training markedly improves comprehension of chart data. (2) Replacing images with their textual representation randomly during end-to-end fine-tuning transfer the language reasoning capability to chart interpretation skills. (3) Requiring the model to first extract the underlying chart data and then answer the question in the fine-tuning can further improve the accuracy. Consequently, we introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension. CHOPINLLM effectively interprets various types of charts, including unannotated ones, while maintaining robust reasoning abilities. Furthermore, we establish a new benchmark to evaluate MLLMs' understanding of different chart types across various comprehension levels. Experimental results show that CHOPINLLM exhibits strong performance in understanding both annotated and unannotated charts across a wide range of types.

8/2/2024

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Xingchen Zeng, Haichuan Lin, Yilin Ye, Wei Zeng

Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through data collection and synthesis. However, our empirical study on existing MLLMs and CQA datasets reveals notable gaps. First, current data collection and synthesis focus on data volume and lack consideration of fine-grained visual encodings and QA tasks, resulting in unbalanced data distribution divergent from practical CQA scenarios. Second, existing work follows the training recipe of the base MLLMs initially designed for natural images, under-exploring the adaptation to unique chart characteristics, such as rich text elements. To fill the gap, we propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. Specifically, we propose a novel data engine to effectively filter diverse and high-quality data from existing datasets and subsequently refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Then, to facilitate the adaptation to chart characteristics, we utilize the enriched data to train an MLLM by unfreezing the vision encoder and incorporating a mixture-of-resolution adaptation strategy for enhanced fine-grained recognition. Experimental results validate the effectiveness of our approach. Even with fewer training examples, our model consistently outperforms state-of-the-art CQA models on established benchmarks. We also contribute a dataset split as a benchmark for future research. Source codes and datasets of this paper are available at https://github.com/zengxingchen/ChartQA-MLLM.

8/13/2024

🤔

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, Dong Yu

With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has been impressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chart image understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal Chart Instruction (textbf{MMC-Instruction}) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we develop MultiModal Chart Assistant (textbf{MMCA}), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (textbf{MMC-Benchmark}), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts. Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the most recent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding of charts. Code and data are available at https://github.com/FuxiaoLiu/MMC.

4/16/2024

👀

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs

Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Enamul Hoque

Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts. To facilitate chart-based reasoning using natural language, various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts. These tasks pose a unique challenge, demanding both vision-language reasoning and a nuanced understanding of chart data tables, visual encodings, and natural language prompts. Despite the recent success of Large Language Models (LLMs) across diverse NLP tasks, their abilities and limitations in the realm of data visualization remain under-explored, possibly due to their lack of multi-modal capabilities. To bridge the gap, this paper presents the first comprehensive evaluation of the recently developed large vision language models (LVLMs) for chart understanding and reasoning tasks. Our evaluation includes a comprehensive assessment of LVLMs, including GPT-4V and Gemini, across four major chart reasoning tasks. Furthermore, we perform a qualitative evaluation of LVLMs' performance on a diverse range of charts, aiming to provide a thorough analysis of their strengths and weaknesses. Our findings reveal that LVLMs demonstrate impressive abilities in generating fluent texts covering high-level data insights while also encountering common problems like hallucinations, factual errors, and data bias. We highlight the key strengths and limitations of chart comprehension tasks, offering insights for future research.

6/4/2024