CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

Read original: arXiv:2409.02834 - Published 9/9/2024 by Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

Overview

CMM-Math is a new Chinese multimodal math dataset designed to evaluate and enhance the mathematics reasoning capabilities of large multimodal models.
The dataset contains over 100,000 math problems spanning various difficulty levels and multiple modalities (e.g., text, images, formulas).
Researchers created CMM-Math to address the lack of comprehensive multimodal math benchmarks, especially for non-English languages.

Plain English Explanation

The CMM-Math dataset is a collection of math problems that combine text, images, and mathematical expressions. It was developed to help evaluate and improve the ability of large AI models to reason about and solve math problems, especially for the Chinese language.

Current math datasets tend to be limited in scope or focused on a single modality, such as text-only math word problems. CMM-Math aims to be more comprehensive, with over 100,000 problems that cover a wide range of difficulty levels and problem types. This allows researchers to thoroughly test how well AI models can understand and reason about math concepts expressed through different mediums, like written explanations, diagrams, and formulas.

By creating this multimodal Chinese math dataset, the researchers hope to drive progress in building AI systems that can truly comprehend and apply mathematical reasoning, not just perform basic calculations. Improving AI's math skills could lead to better educational tools, more capable virtual assistants, and advanced problem-solving capabilities across many domains.

Technical Explanation

The CMM-Math dataset consists of over 100,000 math problems in Chinese, covering a wide range of difficulties and topics. Each problem includes textual descriptions, relevant images or diagrams, and the corresponding mathematical expressions or solutions.

The researchers sourced the problems from Chinese online education platforms and carefully curated and annotated the dataset to ensure high quality and diversity. The problems span arithmetic, algebra, geometry, probability, and other math concepts, with varying levels of complexity.

To establish benchmark performance, the researchers evaluated several state-of-the-art multimodal and math-focused AI models on the CMM-Math dataset. The results showed that while these models performed reasonably well on certain subsets of the problems, there is still significant room for improvement, especially on more challenging math reasoning tasks.

The creation of CMM-Math aims to drive progress in multimodal mathematical reasoning by providing a robust evaluation platform. By testing AI models on this comprehensive dataset, researchers can identify strengths, weaknesses, and areas for further development. Ultimately, the goal is to foster the advancement of AI systems that can truly understand and apply mathematical concepts at a human level.

Critical Analysis

The CMM-Math dataset represents an important step forward in multimodal math benchmarking, particularly for the Chinese language. However, the paper acknowledges several limitations and areas for further research:

The dataset is focused on the Chinese language, which may limit its applicability to other languages and cultural contexts. Expanding the dataset to include more diverse linguistic and cultural perspectives could be valuable.
The curation and annotation process, while thorough, may introduce some biases or inconsistencies that could affect model performance. Further research on the dataset's quality and potential biases would be beneficial.
The baseline model evaluations provide a good starting point, but more in-depth analysis of model strengths, weaknesses, and failure modes could yield additional insights to guide future research.
Exploring the transferability of models trained on CMM-Math to other math datasets or real-world applications would be an important next step.

Overall, the CMM-Math dataset represents a significant contribution to the field of multimodal math reasoning, but continued research and refinement will be necessary to fully realize its potential in advancing the state of the art.

Conclusion

The CMM-Math dataset is a valuable new resource for evaluating and enhancing the mathematical reasoning capabilities of large multimodal AI models, particularly in the context of the Chinese language. By providing a comprehensive benchmark that combines text, images, and mathematical expressions, the dataset aims to drive progress in building AI systems that can truly understand and apply complex mathematical concepts.

While the current baseline model evaluations show room for improvement, the creation of CMM-Math represents an important step forward in multimodal math benchmarking. Continued research and refinement of the dataset, as well as further exploration of model performance and transferability, could lead to significant advancements in the field of AI-powered mathematical reasoning, with potential applications in education, virtual assistants, and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He

Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.

9/9/2024

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, Cheng-Lin Liu

Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 education in Chinese language. To systematically evaluate the capability of multimodal large models in solving Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions, forming the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH questions from elementary to high school levels, provide increased diversity in problem types, solution objectives, visual elements, detailed knowledge points, and standard solution annotations. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation. Our data and code are available.

7/18/2024

🤔

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, Haoran Zhang, Xingwei Qu, Junjie Wang, Ruibin Yuan, Yizhi Li, Zekun Wang, Yudong Liu, Yu-Hsuan Tsai, Fengji Zhang, Chenghua Lin, Wenhao Huang, Jie Fu

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

9/10/2024

Advancing Geometric Problem Solving: A Comprehensive Benchmark for Multimodal Model Evaluation

Kai Sun, Yushi Bai, Ji Qi, Lei Hou, Juanzi Li

To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.

6/28/2024