Khayyam Offline Persian Handwriting Dataset

Read original: arXiv:2406.01025 - Published 6/4/2024 by Pourya Jafarzadeh, Padideh Choobdar, Vahid Mohammadi Safarzadeh

🚀

Overview

This paper presents the Khayyam dataset, a large unconstrained handwriting dataset for elements of the Persian language.
The dataset contains 44,000 words, 60,000 letters, and 6,000 digits, collected from 400 native Persian writers.
The authors focused on collecting rare Persian word samples not found in other available datasets.
The dataset is intended to support training and evaluation of machine learning algorithms for Persian handwriting recognition.

Plain English Explanation

Handwriting recognition is an important application of machine learning, but it requires comprehensive datasets to train and test the algorithms. The Khayyam dataset is a new large dataset of Persian handwriting that the authors created to address this need.

The dataset contains a variety of handwritten elements - words, letters, and digits - collected from 400 native Persian writers. The authors focused on including rare Persian words that are not well represented in other available datasets. This diversity is important, as machine learning models need to be trained on a wide range of handwriting samples to perform well in real-world applications.

By making this dataset publicly available, the researchers hope to enable further developments in Persian handwriting recognition. Researchers can use the Khayyam dataset to train and test their own machine learning models, advancing the state of the art in this field. This could lead to improved applications like automated document processing and digital note-taking for Persian speakers.

Technical Explanation

The Khayyam dataset contains 44,000 Persian words, 60,000 Persian letters, and 6,000 digits, collected from 400 native Persian writers. This is a significant expansion over previous Persian handwriting datasets, which have tended to be smaller and less diverse.

To create the dataset, the authors distributed form templates to the participants and asked them to fill out a variety of handwritten elements, including uncommon Persian words that are often missing from other datasets. This approach was designed to provide a more comprehensive and representative sample of Persian handwriting.

The authors then used machine learning techniques to train models on the digit, letter, and word data from the Khayyam dataset. They report the results of these experiments, demonstrating the dataset's utility for advancing handwriting recognition research.

The Khayyam dataset builds on prior work in this area, such as the MDIW-13 dataset for multi-lingual handwriting, and the MathWriting dataset for handwritten mathematical expressions. However, the unique focus on Persian handwriting makes the Khayyam dataset a valuable addition to the available resources for this language.

Critical Analysis

The Khayyam dataset represents a significant contribution to the field of Persian handwriting recognition. By deliberately including rare Persian word samples, the authors have addressed a key limitation of previous datasets, which tended to over-represent more common words.

However, the paper does not provide detailed information about the demographic diversity of the 400 writers whose samples are included in the dataset. It would be helpful to know the age, gender, and geographic distribution of the participants to better understand the dataset's representativeness.

Additionally, the paper does not discuss potential biases or limitations in the data collection process. For example, were the participants self-selected, and if so, how might this have influenced the types of handwriting samples obtained?

Further research could also explore the dataset's suitability for other applications beyond handwriting recognition, such as forensic analysis or historical document processing. Expanding the dataset's utility and exploring its potential limitations would strengthen its value to the research community.

Conclusion

The Khayyam dataset represents a significant advancement in the availability of Persian handwriting data for machine learning research. By providing a large, diverse, and publicly accessible collection of word, letter, and digit samples, the authors have created a valuable resource for developing and evaluating Persian handwriting recognition algorithms.

This dataset can support a wide range of applications, from automated document processing to digital note-taking, ultimately improving the accessibility and usability of technology for Persian speakers. As the research community continues to explore the dataset's capabilities and limitations, the Khayyam dataset will likely become an increasingly important tool for advancing the field of handwriting recognition.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🚀

Khayyam Offline Persian Handwriting Dataset

Pourya Jafarzadeh, Padideh Choobdar, Vahid Mohammadi Safarzadeh

Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.

6/4/2024

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater

We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

6/17/2024

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Ameer Majeed, Hossein Hassani

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

8/27/2024

MathWriting: A Dataset For Handwritten Mathematical Expression Recognition

Philippe Gervais, Asya Fadeeva, Andrii Maksai

We introduce MathWriting, the largest online handwritten mathematical expression dataset to date. It consists of 230k human-written samples and an additional 400k synthetic ones. MathWriting can also be used for offline HME recognition and is larger than all existing offline HME datasets like IM2LATEX-100K. We introduce a benchmark based on MathWriting data in order to advance research on both online and offline HME recognition.

4/17/2024