Post-OCR Text Correction for Bulgarian Historical Documents

Read original: arXiv:2409.00527 - Published 9/4/2024 by Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov

📊

Overview

Digitizing historical documents is crucial for preserving cultural heritage
Optical Character Recognition (OCR) is a key step, but standard tools struggle with historical orthography and document layouts
This work focuses on improving OCR text correction for historical Bulgarian documents

Plain English Explanation

The conversion of scanned historical documents into searchable digital text is an important task for preserving cultural heritage. This process involves using Optical Character Recognition (OCR) to extract text from the scanned images. However, standard OCR tools often struggle with documents that use outdated spelling and formatting, which is common for historical documents.

To address this challenge, researchers in this work focused on improving the text correction step that is typically applied after OCR. They created the first benchmark dataset for evaluating OCR text correction on historical Bulgarian documents written in the 19th century Drinov orthography. Additionally, they developed a method to automatically generate synthetic training data in both the Drinov and the later Ivanchev orthographies by leveraging existing contemporary Bulgarian literature.

The researchers then used state-of-the-art language models and an encoder-decoder framework, which they further enhanced with specialized techniques like diagonal attention loss and copy and coverage mechanisms. This approach helped reduce the errors introduced during the OCR process and improved the quality of the digitized documents by 25% compared to the previous state-of-the-art on a Bulgarian dataset.

Technical Explanation

The researchers focused on improving the post-OCR text correction process for historical Bulgarian documents. They created the first benchmark dataset for evaluating this task, which includes documents written in the 19th century Drinov orthography, the first standardized Bulgarian writing system.

To expand the training data, the researchers developed a method to automatically generate synthetic data in both the Drinov and Ivanchev orthographies. This was achieved by leveraging large amounts of contemporary Bulgarian literature and applying targeted transformations to simulate the historical spelling and formatting.

The researchers then employed state-of-the-art large language models and an encoder-decoder framework to perform the text correction. They further augmented this approach with specialized techniques, such as diagonal attention loss and copy and coverage mechanisms, to enhance the model's performance.

Through this combination of a benchmark dataset, synthetic data generation, and a tailored model architecture, the researchers were able to reduce the errors introduced during the OCR process by 25%, outperforming the previous state-of-the-art on a Bulgarian dataset by 16%.

Critical Analysis

The researchers have made a valuable contribution by addressing the challenge of OCR text correction for historical documents, specifically focusing on the case of Bulgarian. By creating a benchmark dataset and developing a method for generating synthetic training data, they have laid the groundwork for further research and improvements in this area.

However, one potential limitation of the study is the reliance on contemporary literature as the source for generating synthetic data. While this approach appears effective, it may not fully capture the nuances and variations present in the historical documents themselves. Additionally, the performance improvement of 25% over the previous state-of-the-art, while significant, still leaves room for further advancements.

Future research could explore alternative data generation techniques, such as incorporating more historical sources or leveraging linguistic knowledge to better simulate the historical orthography. Additionally, investigating the effectiveness of the proposed methods on other historical languages and document types could broaden the impact and applicability of this work.

Conclusion

This research addresses a crucial challenge in the digitization of historical documents by improving the post-OCR text correction process for Bulgarian. By creating a benchmark dataset, developing a synthetic data generation method, and leveraging state-of-the-art language models with specialized techniques, the researchers have achieved a substantial 25% reduction in OCR errors, outperforming the previous state-of-the-art.

This work represents an important step forward in preserving the cultural heritage encoded in historical documents and making them more accessible for search, analysis, and further research. The insights and methodologies presented in this study could inspire similar efforts for other languages and document types, ultimately contributing to the broader goal of democratizing access to our collective historical knowledge.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Post-OCR Text Correction for Bulgarian Historical Documents

Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov

The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at url{https://github.com/angelbeshirov/post-ocr-text-correction}.}

9/4/2024

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

Jonathan Bourne

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

9/2/2024

🧠

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Queenie Luo, Yung-Sung Chuang

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

5/16/2024

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Blnd Yaseen, Hossein Hassani

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

4/10/2024