CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Read original: arXiv:2407.12023 - Published 7/18/2024 by Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, Cheng-Lin Liu

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Overview

The paper introduces CMMaTH, a new Chinese multi-modal math skill evaluation benchmark for foundation models.
CMMaTH aims to assess the math capabilities of large language models and multi-modal models by testing them on a variety of math-related tasks.
The benchmark includes both visual and textual math problems, covering areas like algebra, geometry, and word problems.
The goal is to provide a comprehensive evaluation of a model's math reasoning and problem-solving abilities in a real-world, multi-modal setting.

Plain English Explanation

The CMMaTH benchmark is a new tool designed to test the math skills of AI language models and multi-modal models. These models are designed to understand and process both text and visual information, so CMMaTH evaluates their ability to solve math problems that combine text and images.

The benchmark includes a wide range of math problems, covering topics like algebra, geometry, and word problems. This allows the researchers to get a comprehensive understanding of the models' math reasoning and problem-solving capabilities. The problems are written in Chinese, reflecting the real-world use case of these models in a Chinese-language context.

By creating this multi-modal math evaluation tool, the researchers aim to advance the development of AI systems that can truly understand and apply mathematical concepts, rather than just memorizing and reciting formulas. This is an important step towards building AI assistants that can help humans with complex, real-world math tasks.

Technical Explanation

The CMMaTH benchmark was developed to evaluate the math skills of large language models and multi-modal AI systems in a Chinese-language context. It consists of a diverse set of math problems that combine textual descriptions and visual elements, such as diagrams and equations.

The benchmark covers a wide range of math topics, including algebra, geometry, and word problems. This allows for a comprehensive assessment of the models' mathematical reasoning and problem-solving abilities. The problems are designed to be challenging, requiring the models to understand the underlying concepts and perform step-by-step reasoning to arrive at the correct solutions.

To create the CMMaTH dataset, the researchers collected a large number of math problems from various Chinese educational resources and carefully curated and annotated them. They also developed a set of evaluation metrics to measure the models' performance on different aspects of the math problems, such as answer accuracy, step-by-step reasoning, and multi-modal understanding.

The introduction of the CMMaTH benchmark is an important advancement in the field of multi-modal math evaluation, as it provides a more realistic and comprehensive assessment of a model's math capabilities compared to traditional, text-only math benchmarks. By incorporating visual elements and real-world problem scenarios, CMMaTH aims to push the boundaries of multi-modal model evaluation and contribute to the development of trustworthy and efficient multi-modal AI systems.

Critical Analysis

The CMMaTH benchmark is a valuable contribution to the field of multi-modal AI evaluation, as it addresses the need for more comprehensive and realistic math assessment tools. By incorporating visual elements and a wider range of math concepts, the benchmark provides a more holistic evaluation of a model's mathematical understanding and problem-solving abilities.

However, the paper acknowledges that the current version of CMMaTH is limited to the Chinese language and may not fully capture the diversity of math problem-solving strategies used in different cultural and educational contexts. Expanding the benchmark to include problems in other languages and incorporating more diverse problem-solving approaches could further enhance its usefulness and applicability.

Additionally, the paper does not provide a detailed analysis of the performance of existing large language models and multi-modal models on the CMMaTH benchmark. Incorporating such an analysis could help researchers and developers better understand the current state of the art in multi-modal math AI and identify areas for improvement.

Conclusion

The CMMaTH benchmark represents a significant step forward in the evaluation of large language models and multi-modal AI systems' mathematical capabilities. By incorporating both textual and visual elements, the benchmark offers a more realistic and comprehensive assessment of a model's ability to understand and apply mathematical concepts in real-world scenarios.

The introduction of CMMaTH has the potential to drive progress in the development of AI systems that can effectively assist humans with complex, multi-modal math tasks, such as solving word problems, interpreting diagrams, and applying mathematical reasoning in various domains. As the field of multi-modal AI continues to evolve, tools like CMMaTH will play a crucial role in ensuring that these systems are trustworthy, efficient, and able to reliably support human decision-making and problem-solving.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, Cheng-Lin Liu

Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 education in Chinese language. To systematically evaluate the capability of multimodal large models in solving Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions, forming the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH questions from elementary to high school levels, provide increased diversity in problem types, solution objectives, visual elements, detailed knowledge points, and standard solution annotations. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation. Our data and code are available.

7/18/2024

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He

Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.

9/9/2024

🤔

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, Haoran Zhang, Xingwei Qu, Junjie Wang, Ruibin Yuan, Yizhi Li, Zekun Wang, Yudong Liu, Yu-Hsuan Tsai, Fengji Zhang, Chenghua Lin, Wenhao Huang, Jie Fu

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

9/10/2024

Advancing Geometric Problem Solving: A Comprehensive Benchmark for Multimodal Model Evaluation

Kai Sun, Yushi Bai, Ji Qi, Lei Hou, Juanzi Li

To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.

6/28/2024