Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Read original: arXiv:2304.03427 - Published 5/16/2024 by Queenie Luo, Yung-Sung Chuang

🧠

Overview

Scholars rely heavily on ancient manuscripts to study history, religion, and socio-political structures
Many efforts have been made to digitize these manuscripts using Optical Character Recognition (OCR) technology
However, most manuscripts are blemished, making it difficult for OCR to accurately capture faded text and stains on pages
This paper presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct noisy OCR output

Plain English Explanation

Over the centuries, ancient manuscripts have become an invaluable resource for scholars studying history, religion, and social structures of the past. In an effort to preserve and expand access to these precious documents, many organizations have turned to Optical Character Recognition (OCR) technology to digitize the manuscripts.

However, these manuscripts often suffer from age-related damage, such as faded text and stains on the pages. This can make it challenging for OCR programs to accurately capture all the information on the pages. As a result, the digital versions of the manuscripts may contain errors and typos.

To address this issue, the researchers in this paper developed a neural spelling correction model that can automatically fix these errors in the OCR-processed Tibetan manuscripts. The model is built on the Google OCR-ed versions of the Tibetan manuscripts, and it uses a Transformer architecture with a Confidence Score mechanism to perform the spelling correction task.

Technical Explanation

The researchers first feature-engineered their raw Tibetan text corpus into two sets of structured data frames: a set of paired toy data and a set of paired real data. They then implemented a Confidence Score mechanism into the Transformer architecture to perform the spelling correction tasks.

According to the Loss and Character Error Rate (CER) metrics, the Transformer + Confidence Score mechanism architecture outperformed other models, such as Transformer, LSTM-2-LSTM, and GRU-2-GRU. To further analyze the robustness of their model, the researchers examined the erroneous tokens, visualized the Attention and Self-Attention heatmaps, and provided insights into the model's performance.

Critical Analysis

The researchers acknowledge that their model is specifically designed for Tibetan manuscripts, and they suggest that further research is needed to test the generalizability of their approach to other languages and manuscript types. Additionally, the paper does not provide a detailed discussion of the limitations of the Confidence Score mechanism or the impact of the feature engineering process on the model's performance.

While the results are promising, it would be valuable to see the researchers compare their model's performance to human-corrected versions of the OCR output to better understand the practical implications of their work. Additionally, the paper could have explored the potential for incorporating additional contextual information, such as document structure or metadata, to further improve the spelling correction capabilities.

Conclusion

This paper presents a novel approach to addressing the challenge of OCR errors in digitized ancient manuscripts, specifically focusing on Tibetan texts. By developing a Transformer-based neural spelling correction model with a Confidence Score mechanism, the researchers have demonstrated a promising solution for automatically correcting the noisy output of OCR systems.

The findings of this research have the potential to significantly improve the accuracy and accessibility of digitized manuscript collections, enabling scholars to more effectively study and understand the history, religion, and socio-political structures of the past. As the field of digital humanities continues to evolve, this work serves as an important contribution to the ongoing efforts to preserve and analyze these invaluable cultural heritage resources.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Queenie Luo, Yung-Sung Chuang

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

5/16/2024

Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction

Dingyao Yu, Yang An, Wei Ye, Xiongfeng Xiao, Shaoguang Mao, Tao Ge, Shikun Zhang

Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) textit{Random Replacement} with the guidance of confusion sets and (2) textit{OCR/ASR-based Generation} that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).

7/23/2024

💬

Contextual Spelling Correction with Language Model for Low-resource Setting

Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, Subarna Shakya

The task of Spell Correction(SC) in low-resource languages presents a significant challenge due to the availability of only a limited corpus of data and no annotated spelling correction datasets. To tackle these challenges a small-scale word-based transformer LM is trained to provide the SC model with contextual understanding. Further, the probabilistic error rules are extracted from the corpus in an unsupervised way to model the tendency of error happening(error model). Then the combination of LM and error model is used to develop the SC model through the well-known noisy channel framework. The effectiveness of this approach is demonstrated through experiments on the Nepali language where there is access to just an unprocessed corpus of textual data.

4/30/2024

📉

A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance

Amirreza Naziri, Hossein Zeinali

Writing, as an omnipresent form of human communication, permeates nearly every aspect of contemporary life. Consequently, inaccuracies or errors in written communication can lead to profound consequences, ranging from financial losses to potentially life-threatening situations. Spelling mistakes, among the most prevalent writing errors, are frequently encountered due to various factors. This research aims to identify and rectify diverse spelling errors in text using neural networks, specifically leveraging the Bidirectional Encoder Representations from Transformers (BERT) masked language model. To achieve this goal, we compiled a comprehensive dataset encompassing both non-real-word and real-word errors after categorizing different types of spelling mistakes. Subsequently, multiple pre-trained BERT models were employed. To ensure optimal performance in correcting misspelling errors, we propose a combined approach utilizing the BERT masked language model and Levenshtein distance. The results from our evaluation data demonstrate that the system presented herein exhibits remarkable capabilities in identifying and rectifying spelling mistakes, often surpassing existing systems tailored for the Persian language.

7/25/2024