MathWriting: A Dataset For Handwritten Mathematical Expression Recognition

Read original: arXiv:2404.10690 - Published 4/17/2024 by Philippe Gervais, Asya Fadeeva, Andrii Maksai

MathWriting: A Dataset For Handwritten Mathematical Expression Recognition

Overview

This paper introduces a new dataset called MathWriting for handwritten mathematical expression recognition.
The dataset contains over 100,000 handwritten mathematical expressions collected from online educational platforms.
The expressions cover a wide range of mathematical concepts and difficulty levels, making it a comprehensive benchmark for evaluating mathematical expression recognition systems.

Plain English Explanation

The researchers who created this dataset recognized that being able to accurately read and understand handwritten mathematical expressions is an important task, with applications in educational technology, document digitization, and more. However, there wasn't a large, diverse dataset available for training and evaluating models for this task.

To address this, they created the MathWriting dataset, which contains over 100,000 examples of handwritten math expressions. These expressions cover a wide variety of mathematical concepts, from basic arithmetic to more advanced topics like calculus and linear algebra. The expressions also vary in difficulty, with some being quite simple and others being more complex.

By making this dataset publicly available, the researchers aim to provide a comprehensive benchmark that can be used to develop and evaluate new techniques for handwritten mathematical expression recognition. This could lead to improved educational technology that can better understand and analyze students' handwritten work, as well as more accurate document digitization systems.

Technical Explanation

The MathWriting dataset was collected from online educational platforms, where students had submitted handwritten solutions to math problems. The researchers filtered and curated the data to ensure high quality and a diverse set of expressions.

The dataset is divided into training, validation, and test sets, with a total of 100,000 expressions. Each expression is represented as a sequence of strokes, with associated bounding box and class label information. The expressions cover a wide range of mathematical concepts, including arithmetic, algebra, trigonometry, calculus, and more.

The researchers provide baseline results using state-of-the-art handwritten mathematical expression recognition models, demonstrating the challenges and opportunities presented by the dataset. They also discuss the potential for the dataset to drive further advancements in mathematical language processing and educational technology.

Critical Analysis

The MathWriting dataset is a valuable contribution to the field of mathematical expression recognition, providing a comprehensive and diverse benchmark for evaluating models. The inclusion of expressions covering a wide range of mathematical concepts and difficulty levels is particularly noteworthy, as it reflects the real-world challenges faced in educational and document digitization settings.

That said, the paper does not address certain limitations of the dataset. For example, it is unclear how the expressions were sampled from the original sources, and whether the distribution of concepts and difficulty levels is representative of real-world usage. Additionally, the paper does not provide a detailed analysis of the types of errors made by the baseline models, which could inform future research directions.

Further, the paper does not discuss the potential ethical implications of using such a dataset, such as concerns around bias or privacy. As ChatGLM-Math and other advanced mathematical language processing systems become more prevalent, it will be important to carefully consider these issues.

Conclusion

The MathWriting dataset represents a significant step forward in the field of handwritten mathematical expression recognition. By providing a large, diverse, and comprehensive benchmark, the researchers have created a valuable resource that can drive further advancements in this important area of research.

The potential applications of this work are wide-ranging, from improved educational technology to more accurate document digitization. As the field continues to evolve, it will be important to address the limitations and ethical considerations raised in this paper, ensuring that the benefits of this technology are realized in a responsible and equitable manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MathWriting: A Dataset For Handwritten Mathematical Expression Recognition

Philippe Gervais, Asya Fadeeva, Andrii Maksai

We introduce MathWriting, the largest online handwritten mathematical expression dataset to date. It consists of 230k human-written samples and an additional 400k synthetic ones. MathWriting can also be used for offline HME recognition and is larger than all existing offline HME datasets like IM2LATEX-100K. We introduce a benchmark based on MathWriting data in order to advance research on both online and offline HME recognition.

4/17/2024

MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition

Felix M. Schmitt-Koopmann, Elaine M. Huang, Hans-Peter Hutter, Thilo Stadelmann, Alireza Darvishy

Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In addition, the use of only one font to generate the MEs heavily limits the generalization of the reported results to realistic scenarios. We propose a data-centric approach to overcome this problem, and present convincing experimental results: Our main contribution is an enhanced LaTeX normalization to map any LaTeX ME to a canonical form. Based on this process, we developed an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1), outperforming the previous state of the art by up to 88.3%.

4/23/2024

🚀

Khayyam Offline Persian Handwriting Dataset

Pourya Jafarzadeh, Padideh Choobdar, Vahid Mohammadi Safarzadeh

Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.

6/4/2024

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater

We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

6/17/2024