Efficient OCR for Building a Diverse Digital History

Read original: arXiv:2304.02737 - Published 7/29/2024 by Jacob Carlson, Tom Bryan, Melissa Dell

🛸

Overview

Digital archives are important, but the information they contain is often unrepresentative of the full diversity of documentary history.
Existing optical character recognition (OCR) models, which jointly learn vision and language, struggle with low-resource document collections.
This study proposes modeling OCR as a character-level image retrieval problem, using a contrastively trained vision encoder.
This approach is more sample-efficient and extensible than existing architectures, enabling accurate OCR in settings where current solutions fail.
Crucially, this model opens new avenues for community engagement in making digital history more representative.

Plain English Explanation

The researchers point out that while many people use digital archives every day, the information available in these archives does not fully represent the diversity of historical documents. The typical approach to optical character recognition (OCR), which involves jointly learning a vision and language model, has difficulty handling document collections with limited resources, as it requires a large amount of labeled data and computing power to train.

To address this, the researchers propose modeling OCR as a character-level image retrieval problem, using a contrastively trained vision encoder. This means the model learns to recognize the visual features of individual characters, rather than trying to jointly learn both the visual and language aspects. Because the model only focuses on the visual features, it is more efficient and can be applied to a wider range of document collections, even those with limited resources.

Importantly, this new approach to OCR opens up new ways for communities to get involved in making digital historical records more representative of the full diversity of documentary history. By making OCR more accessible and adaptable, the researchers hope to enable more people to contribute to building digital archives that better reflect the richness of the past.

Technical Explanation

The researchers frame OCR as a character-level image retrieval problem, using a contrastively trained vision encoder. This differs from the typical sequence-to-sequence architecture, which jointly learns a vision and language model. The researchers argue that the sequence-to-sequence approach is poorly extensible to low-resource document collections, as it requires extensive labeled sequences and compute to train a language-vision model.

In contrast, the researchers' character-level retrieval model only needs to learn the visual features of characters, making it more sample-efficient and extensible. The model is trained using a contrastive loss function, which encourages the vision encoder to learn discriminative representations of individual characters.

This approach enables accurate OCR in settings where existing solutions struggle, as the model does not need to learn a full language model. Crucially, the researchers highlight that this model opens new avenues for community engagement in expanding the diversity of digital historical records, as it is more accessible and adaptable than previous OCR techniques.

Critical Analysis

The researchers acknowledge that their approach has certain limitations. For example, they note that the character-level retrieval model may struggle with longer-range linguistic dependencies that are better captured by a full language model. Additionally, the paper does not provide a detailed evaluation of the model's performance compared to state-of-the-art OCR systems.

One potential concern is that the researchers' emphasis on community engagement and expanding the diversity of digital historical records may lead to a trade-off in terms of accuracy or reliability. It will be important to carefully consider how to balance these priorities and ensure that the OCR system maintains a high standard of quality, even as it becomes more accessible and adaptable.

Furthermore, the researchers do not address potential issues related to data privacy, copyright, or ethical considerations in the process of expanding digital archives. These are important factors that should be carefully considered when developing technologies for community-driven digitization efforts.

Conclusion

This research proposes a novel approach to optical character recognition, modeling it as a character-level image retrieval problem using a contrastively trained vision encoder. This method is more sample-efficient and extensible than existing OCR architectures, enabling accurate recognition even in low-resource settings.

Crucially, the researchers highlight that this model opens new avenues for community engagement in making digital historical records more representative of the full diversity of documentary history. By making OCR more accessible and adaptable, this work has the potential to democratize the process of building digital archives and expand the range of voices and perspectives represented in these important resources.

While the research has some limitations, it represents a significant step forward in addressing the challenges of OCR in underserved document collections. As the researchers continue to develop and refine this approach, it will be important to carefully consider the ethical implications and ensure that the resulting digital archives maintain high standards of quality and reliability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

Efficient OCR for Building a Diverse Digital History

Jacob Carlson, Tom Bryan, Melissa Dell

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.

7/29/2024

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

Jonathan Bourne

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

9/2/2024

👁️

Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

Ahmed Mustafa, Muhammad Tahir Rafique, Muhammad Ijlal Baig, Hasan Sajid, Muhammad Jawad Khan, Karam Dad Kallu

This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.

9/2/2024

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as characters and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above characters under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

9/4/2024