Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

Read original: arXiv:2408.15119 - Published 9/2/2024 by Ahmed Mustafa, Muhammad Tahir Rafique, Muhammad Ijlal Baig, Hasan Sajid, Muhammad Jawad Khan, Karam Dad Kallu

👁️

Overview

This research paper introduces a novel Optical Character Recognition (OCR) model for digital Urdu text recognition.
The model utilizes transformer-based architectures and attention mechanisms, trained on a large dataset of Urdu text images.
The model achieved a character error rate (CER) of 0.178, demonstrating its superior accuracy in recognizing Urdu characters.

Plain English Explanation

The researchers developed a new Optical Character Recognition (OCR) model specifically designed to recognize Urdu text in digital images. Urdu is a language written in the Perso-Arabic script, which can be challenging for traditional OCR systems to accurately recognize.

To address this, the researchers used a specialized transformer-based architecture and attention mechanisms to build their model. They trained it on a large dataset of around 160,000 Urdu text images, which allowed the model to learn the nuances of the Urdu script.

The researchers measured the model's performance using a metric called character error rate (CER), which indicates how accurately the model can recognize individual characters. Their model achieved a CER of 0.178, which is quite low, meaning it can accurately recognize Urdu text.

The key innovation in this model is its use of a "permuted autoregressive sequence" (PARSeq) approach, which allows the model to consider both the context before and after each character when making its recognition decisions. This helps the model better understand the full meaning and structure of the Urdu text, leading to more accurate results.

Technical Explanation

The researchers developed a novel word-level OCR model for digital Urdu text recognition. They utilized transformer-based architectures and attention mechanisms to build their model, which was trained on a comprehensive dataset of approximately 160,000 Urdu text images.

The model's unique architecture incorporates the permuted autoregressive sequence (PARSeq) model, which allows for context-aware inference and iterative refinement. By leveraging bidirectional context information, the PARSeq approach enhances the model's recognition accuracy.

Through extensive experiments, the researchers demonstrated that their model achieves a character error rate (CER) of 0.178, which highlights its superior performance in recognizing Urdu characters compared to previous approaches. The model's capability to handle a diverse range of Urdu text styles, fonts, and variations further enhances its applicability in real-world scenarios.

Critical Analysis

While the researchers' model has shown promising results, the paper also mentions some limitations that could be addressed in future work. The model can struggle with blurred images, non-horizontal orientations, and overlays of patterns, lines, or other text, which can occasionally lead to suboptimal performance.

Additionally, the researchers note that trailing or following punctuation marks can introduce noise into the recognition process, affecting the model's accuracy. Addressing these challenges will be a focus of future research, as the researchers aim to refine the model further, explore data augmentation techniques, optimize hyperparameters, and integrate contextual improvements for more accurate and efficient Urdu text recognition.

Conclusion

This research paper presents a significant advancement in Urdu text recognition by introducing a novel OCR model that leverages transformer-based architectures and attention mechanisms. The model's unique PARSeq approach and its ability to handle a wide range of Urdu text variations make it a valuable tool for various applications, such as digital archiving, content analysis, and language preservation.

The researchers' work highlights the importance of developing specialized OCR models for underrepresented languages, as traditional approaches may not be able to capture the nuances of scripts like Urdu. By addressing the model's current limitations, the researchers can further improve its performance and pave the way for more accurate and accessible digital Urdu text recognition.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

Ahmed Mustafa, Muhammad Tahir Rafique, Muhammad Ijlal Baig, Hasan Sajid, Muhammad Jawad Khan, Karam Dad Kallu

This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.

9/2/2024

DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer

Da Chang, Yu Li

With the continuous development of Optical Character Recognition (OCR) and the expansion of application fields, text recognition in complex scenes has become a key challenge. Factors such as multiple fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. Although OCR models based on deep learning have performed well in specific fields or similar datasets in recent years, the generalization ability and robustness of the model are still a big challenge when facing complex environments with multiple scenes. Furthermore, training an OCR model from scratch or fine-tuning all parameters is very demanding on computing resources and inference time, which limits the flexibility of its application. This study focuses on a fundamental aspect of mixed text recognition in response to the challenges mentioned above, which involves effectively fine-tuning the pre-trained basic OCR model to demonstrate exceptional performance across various downstream tasks. To this end, we propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR. This method embeds DoRA into the image encoder and LoRA into the internal structure of the text decoder, enabling efficient parameter fine-tuning for downstream tasks. Experiments show that compared to similar parameter adjustment methods, our model DLoRA-TrOCR has the smallest number of parameters and performs better. It can achieve state-of-the-art performance on complex scene datasets involving simultaneous recognition of mixed handwritten, printed and street view texts.

4/24/2024

🛸

Efficient OCR for Building a Diverse Digital History

Jacob Carlson, Tom Bryan, Melissa Dell

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.

7/29/2024

Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach

S M Rakib Hasan, Aakar Dhakal, Md Humaion Kabir Mehedi, Annajiat Alim Rasel

Efforts on the research and development of OCR systems for Low-Resource Languages are relatively new. Low-resource languages have little training data available for training Machine Translation systems or other systems. Even though a vast amount of text has been digitized and made available on the internet the text is still in PDF and Image format, which are not instantly accessible. This paper discusses text recognition for two scripts: Bengali and Nepali; there are about 300 and 40 million Bengali and Nepali speakers respectively. In this study, using encoder-decoder transformers, a model was developed, and its efficacy was assessed using a collection of optical text images, both handwritten and printed. The results signify that the suggested technique corresponds with current approaches and achieves high precision in recognizing text in Bengali and Nepali. This study can pave the way for the advanced and accessible study of linguistics in South East Asia.

4/4/2024