Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Read original: arXiv:2408.13631 - Published 8/27/2024 by Ameer Majeed, Hossein Hassani

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Overview

Developing handwritten optical character recognition (OCR) for the East Syriac script, an ancient writing system.
Creating the KHAMIS dataset, a collection of handwritten East Syriac text samples, to enable this OCR system.
Exploring the unique challenges of working with this understudied script compared to more common writing systems.

Plain English Explanation

The paper discusses the process of creating an optical character recognition (OCR) system for the East Syriac script, an ancient writing system used in the Middle East. The researchers developed the KHAMIS dataset, a collection of handwritten East Syriac text samples, to serve as the foundation for this OCR system.

The East Syriac script presents unique challenges compared to more widely studied writing systems like Latin or Arabic. For example, the script has fewer distinct characters, but each character can have multiple variations depending on its placement within a word. This makes it harder for an OCR system to accurately recognize the characters.

By creating the KHAMIS dataset, the researchers have provided a valuable resource for researchers and developers interested in working with the East Syriac script. The dataset can be used to train and test OCR models, ultimately helping to make this ancient written language more accessible in the digital age.

Technical Explanation

The paper describes the process of developing an optical character recognition (OCR) system for the East Syriac script, an understudied writing system used in the Middle East. To enable this OCR system, the researchers created the KHAMIS dataset, a collection of handwritten East Syriac text samples.

The East Syriac script presents unique challenges compared to more common writing systems like Latin or Arabic. For example, while it has fewer distinct characters, each character can have multiple variations depending on its placement within a word. This makes it more difficult for an OCR system to accurately recognize the characters.

To address these challenges, the researchers designed a dataset collection and annotation process tailored to the specific needs of the East Syriac script. They recruited native speakers to write text samples, which were then carefully labeled and prepared for use in training OCR models.

The paper also discusses the potential applications of this research, such as making old Kurdish publications processable and improving the accessibility of historic Syriac manuscripts.

Critical Analysis

The paper provides a valuable contribution to the field of OCR by addressing an understudied writing system. The researchers have demonstrated a thoughtful and systematic approach to dataset creation, which is essential for developing effective OCR models.

However, the paper does not discuss potential limitations or caveats of the research. For example, it is unclear how the OCR models would perform on a more diverse range of handwriting styles or on degraded historical documents. Additionally, the paper does not mention any plans for further expanding or refining the KHAMIS dataset.

Future research could explore ways to improve the segmentation and recognition of cursive Syriac text, or investigate the application of more advanced deep learning techniques, such as those used for Persian handwriting recognition.

Conclusion

This paper presents a significant step forward in the development of handwritten optical character recognition for the East Syriac script. By creating the KHAMIS dataset, the researchers have provided a valuable resource for researchers and developers working with this understudied writing system.

The unique challenges of the East Syriac script, such as the multiple variations of each character, highlight the importance of tailoring OCR approaches to the specific needs of individual writing systems. The insights and methodologies developed in this research could inform similar efforts to digitize and process other endangered or historical scripts, ultimately helping to preserve and make accessible the world's diverse written heritage.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Ameer Majeed, Hossein Hassani

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

8/27/2024

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Blnd Yaseen, Hossein Hassani

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

4/10/2024

🖼️

Image Based Character Recognition, Documentation System To Decode Inscription From Temple

Velmathi G, Shangavelan M, Harish D, Krithikshun M S

This project undertakes the training and analysis of optical character recognition OCR methods applied to 10th century ancient Tamil inscriptions discovered on the walls of the Brihadeeswarar Temple.The chosen OCR methods include Tesseract,a widely used OCR engine,using modern ICR techniques to pre process the raw data and a box editing software to finetune our model.The analysis with Tesseract aims to evaluate their effectiveness in accurately deciphering the nuances of the ancient Tamil characters.The performance of our model for the dataset are determined by their accuracy rate where the evaluated dataset divided into training set and testing set.By addressing the unique challenges posed by the script's historical context,this study seeks to contribute valuable insights to the broader field of OCR,facilitating improved preservation and interpretation of ancient inscriptions

5/29/2024

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater

We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

6/17/2024