Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Read original: arXiv:2406.09630 - Published 6/17/2024 by Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Overview

• This paper introduces the Muharaf dataset, a large collection of handwritten Arabic manuscripts that can be used for cursive text recognition research. • The dataset contains over 1 million word images from more than 10,000 pages of historical Arabic manuscripts, covering a range of styles, time periods, and writing quality. • Muharaf aims to provide a comprehensive resource for developing and evaluating handwritten Arabic text recognition systems, particularly for challenging cursive scripts.

Plain English Explanation

The Muharaf dataset is a valuable new resource for researchers working on handwritten Arabic text recognition. It provides a large, diverse collection of handwritten Arabic manuscripts that can be used to train and test AI models for recognizing cursive Arabic script.

Cursive writing, where the letters are connected in a flowing, continuous style, is particularly challenging for optical character recognition (OCR) systems. The Muharaf dataset addresses this by including over 1 million word images from more than 10,000 pages of historical Arabic manuscripts, covering a wide range of writing styles, time periods, and quality levels.

This diversity is important because it allows AI models to be trained on a broad representation of real-world handwritten Arabic, improving their ability to accurately recognize text in different contexts. The Khayyam Offline Persian Handwriting Dataset, MathWriting Dataset for Handwritten Mathematical Expression Recognition, and MDIW-13: A New Multi-Lingual, Multi-Script Dataset are other valuable handwritten text datasets that serve similar purposes for Persian, mathematical, and multi-script recognition, respectively.

By providing this comprehensive dataset, the Muharaf project aims to advance the state of the art in handwritten Arabic text recognition, which has applications in areas like historical document digitization, digital archiving, and automation of Arabic writing-based tasks.

Technical Explanation

The Muharaf dataset consists of over 1 million word images from more than 10,000 pages of historical Arabic manuscripts, collected from various libraries and archives. The manuscripts cover a broad range of styles, time periods, and writing quality, representing the diversity of real-world cursive Arabic handwriting.

To construct the dataset, the researchers first digitized the manuscript pages using high-resolution scanners. They then used manual and semi-automated techniques to segment the pages into individual word images and annotate them with ground truth text labels. Special care was taken to ensure accurate transcription of the challenging cursive scripts.

The resulting dataset is divided into training, validation, and test sets to support robust model development and evaluation. Baseline experiments using state-of-the-art handwritten text recognition models demonstrate the dataset's utility for advancing the field of cursive Arabic text recognition.

The Arabic Handwritten Text for Person Biometric Identification using Deep Learning and CORU: A Comprehensive Post-OCR Parsing and Receipt Understanding Dataset papers describe other notable Arabic handwritten text datasets that may complement the Muharaf dataset for certain applications.

Critical Analysis

The Muharaf dataset represents a significant contribution to the field of handwritten Arabic text recognition. Its large scale, diversity, and focus on cursive scripts address important gaps in existing resources. The careful data collection and annotation processes ensure high-quality ground truth labels, which are essential for training reliable AI models.

However, the paper does not provide detailed information about the demographic or geographic distribution of the manuscript sources, which could be relevant for understanding potential biases in the dataset. Additionally, while the authors mention plans for future dataset expansions, the current version may not capture the full range of handwriting styles and quality levels encountered in real-world applications.

Researchers using the Muharaf dataset should be aware of these potential limitations and consider supplementing it with other relevant datasets, such as those mentioned in the technical explanation, to develop and evaluate their systems as comprehensively as possible.

Conclusion

The Muharaf dataset represents a valuable new resource for advancing handwritten Arabic text recognition research. By providing a large, diverse collection of cursive Arabic manuscript data, the dataset enables the development and evaluation of more robust and accurate AI models for tasks such as historical document digitization, digital archiving, and automation of Arabic writing-based workflows.

The dataset's comprehensive coverage of writing styles, time periods, and quality levels is a key strength, as it helps ensure that AI systems trained on Muharaf can generalize to a wide range of real-world handwritten Arabic text. As the field of Arabic handwriting recognition continues to evolve, the Muharaf dataset is poised to play a crucial role in driving progress and expanding the capabilities of these technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater

We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

6/17/2024

🚀

Khayyam Offline Persian Handwriting Dataset

Pourya Jafarzadeh, Padideh Choobdar, Vahid Mohammadi Safarzadeh

Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.

6/4/2024

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Ameer Majeed, Hossein Hassani

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

8/27/2024

👁️

An End-to-End, Segmentation-Free, Arabic Handwritten Recognition Model on KHATT

Sondos Aabed, Ahmad Khairaldin

An end-to-end, segmentation-free, deep learning model trained from scratch is proposed, leveraging DCNN for feature extraction, alongside Bidirectional Long-Short Term Memory (BLSTM) for sequence recognition and Connectionist Temporal Classification (CTC) loss function on the KHATT database. The training phase yields remarkable results 84% recognition rate on the test dataset at the character level and 71% on the word level, establishing an image-based sequence recognition framework that operates without segmentation only at the line level. The analysis and preprocessing of the KFUPM Handwritten Arabic TexT (KHATT) database are also presented. Finally, advanced image processing techniques, including filtering, transformation, and line segmentation are implemented. The importance of this work is highlighted by its wide-ranging applications. Including digitizing, documentation, archiving, and text translation in fields such as banking. Moreover, AHR serves as a pivotal tool for making images searchable, enhancing information retrieval capabilities, and enabling effortless editing. This functionality significantly reduces the time and effort required for tasks such as Arabic data organization and manipulation.

6/24/2024