Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Read original: arXiv:2404.06101 - Published 4/10/2024 by Blnd Yaseen, Hossein Hassani
Total Score

0

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper focuses on making old Kurdish publications processable by improving optical character recognition (OCR) engines.
  • The authors discuss the history of the printing press in Iraq and Iraqi Kurdistan, and the challenges of processing old Kurdish texts using existing OCR tools.
  • The paper proposes an approach to augment available OCR engines to better handle the unique characteristics of old Kurdish publications.

Plain English Explanation

The paper is about a problem faced when trying to digitize and process old Kurdish publications using existing optical character recognition (OCR) technology. OCR is a process that allows computers to read and understand text in images or scanned documents.

However, the authors explain that traditional OCR tools struggle with the unique features of old Kurdish publications. This is because the Kurdish language has its own alphabet and writing style, which can be different from the Latin-based scripts that most OCR engines are designed for.

To address this issue, the researchers propose a way to enhance the capabilities of existing OCR engines to better handle the characteristics of old Kurdish texts. By modifying and fine-tuning the OCR algorithms, they aim to make it easier to convert these historical Kurdish publications into a digital, machine-readable format.

This is an important problem to solve because it would allow researchers, historians, and the general public to more easily access and analyze these valuable Kurdish literary and cultural resources. By making old Kurdish publications more "processable" (or easier for computers to understand), the authors hope to preserve and unlock the knowledge contained within these texts.

Technical Explanation

The paper begins by providing historical context on the development of the printing press in Iraq and the Iraqi Kurdistan region. It highlights the challenges of processing old Kurdish publications using conventional OCR tools, which are often optimized for Latin-based scripts.

To address this problem, the researchers propose an approach to augment available OCR engines. This involves modifying the OCR algorithms to better recognize the unique characters, ligatures, and writing styles found in old Kurdish texts. The authors likely leverage techniques such as neural network-based text recognition, language-specific model fine-tuning, and corpus-building for low-resource languages to enhance the OCR performance.

The paper also discusses the importance of assessing the quality of information extraction from the OCR-processed texts, as errors or inaccuracies could impact downstream applications and research.

Critical Analysis

The paper highlights a valuable and underexplored problem in the field of digital preservation and text processing. The authors recognize the unique challenges posed by old Kurdish publications and propose a thoughtful approach to address them.

However, the paper does not provide detailed technical information about the specific OCR augmentation methods or the evaluation of their proposed approach. More empirical results and analysis would be helpful to fully assess the effectiveness and limitations of the proposed solution.

Additionally, the paper does not discuss potential biases or errors that could be introduced by the modified OCR engines. It would be important to consider how these issues could affect the reliability and trustworthiness of the digitized Kurdish texts, especially for historical and cultural research.

Further research could also explore the application of scene text recognition techniques to handle the diverse typography and layout characteristics often found in old publications.

Conclusion

This paper identifies an important problem in preserving and processing historical Kurdish publications using optical character recognition technology. The authors propose a solution to augment existing OCR engines to better handle the unique features of old Kurdish texts.

While the paper provides valuable context and a high-level approach, more technical details and empirical evaluation would be needed to fully assess the feasibility and impact of the proposed method. Nonetheless, this research highlights the importance of developing language-specific text processing solutions to ensure the preservation and accessibility of cultural heritage materials.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines
Total Score

0

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Blnd Yaseen, Hossein Hassani

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

Read more

4/10/2024

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset
Total Score

0

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Ameer Majeed, Hossein Hassani

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

Read more

8/27/2024

🛸

Total Score

0

Efficient OCR for Building a Diverse Digital History

Jacob Carlson, Tom Bryan, Melissa Dell

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.

Read more

7/29/2024

📊

Total Score

0

Post-OCR Text Correction for Bulgarian Historical Documents

Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov

The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at url{https://github.com/angelbeshirov/post-ocr-text-correction}.}

Read more

9/4/2024