Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach

Read original: arXiv:2404.02375 - Published 4/4/2024 by S M Rakib Hasan, Aakar Dhakal, Md Humaion Kabir Mehedi, Annajiat Alim Rasel

Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach

Overview

This paper presents a Transformer-based approach for optical text recognition (OCR) in Nepali and Bengali, two low-resource languages.
The researchers developed a neural network model that can accurately extract text from images of Nepali and Bengali documents.
The model was trained on a new dataset of Nepali and Bengali text images, and its performance was evaluated on several benchmark tasks.

Plain English Explanation

The paper describes a new way to automatically read and extract text from images of documents written in Nepali and Bengali. These are two languages that are not as widely studied or supported by technology as more common languages like English.

The researchers created a deep learning model, based on a Transformer architecture, that can look at an image of Nepali or Bengali text and accurately identify the words and characters. This is a challenging task because these languages have complex writing systems with many different characters and symbols.

To train the model, the researchers built a new dataset of Nepali and Bengali text images, which they used to teach the model how to recognize the text. They then tested the model on several benchmark tasks to see how well it could extract text from different types of Nepali and Bengali documents.

The key advantage of this approach is that it can make it much easier to digitize and process Nepali and Bengali text, which is important for applications like document archiving, translation, and information retrieval in these languages. By developing more advanced OCR technology for low-resource languages, the researchers are helping to make these languages more accessible and usable in the digital world.

Technical Explanation

The paper introduces a Transformer-based model for optical character recognition (OCR) in Nepali and Bengali, two low-resource Indic languages. The model uses a multi-headed attention mechanism to learn contextual representations of characters and words, which allows it to accurately recognize text even in complex, degraded, or noisy document images.

To train the model, the researchers created a new dataset of Nepali and Bengali text images, collected from various online sources. The dataset contains a diverse range of document types, including scanned books, handwritten notes, and born-digital content. The researchers preprocessed the images and annotated the text content to create training, validation, and test sets.

The Transformer-based OCR model consists of a CNN-based feature extraction backbone, followed by a Transformer encoder and a character recognition head. The model is trained end-to-end on the Nepali and Bengali text image dataset, learning to map input images to sequences of recognized characters.

Experiments show that the Transformer-based model outperforms previous state-of-the-art OCR approaches for Nepali and Bengali, achieving character recognition accuracy of over 90% on benchmark datasets. The researchers also demonstrate the model's ability to generalize to diverse document layouts and degradation levels.

Critical Analysis

The paper presents a promising approach for advancing optical text recognition in low-resource Indic languages like Nepali and Bengali. The Transformer-based architecture allows the model to effectively capture the complex contextual relationships between characters and words, which is crucial for accurate OCR in these languages.

One limitation mentioned in the paper is the relatively small size of the training dataset, which may constrain the model's ability to generalize to all possible variations of Nepali and Bengali text. The researchers suggest that expanding the dataset, potentially by incorporating more diverse document sources, could further improve the model's performance.

Additionally, the paper does not provide a detailed analysis of the model's failure cases or error patterns. Understanding the types of errors the model makes, and the underlying linguistic or visual factors that contribute to those errors, could inform future improvements to the architecture or training process.

It would also be valuable to see the model evaluated on a broader range of downstream tasks, such as document understanding or information extraction, to assess its practical utility in real-world applications. Integrating the OCR model with other language processing components could unlock new possibilities for working with Nepali and Bengali digital content.

Conclusion

This paper presents a novel Transformer-based approach for optical text recognition in Nepali and Bengali, two low-resource Indic languages. The researchers developed a high-performing OCR model that can accurately extract text from diverse document images, outperforming previous state-of-the-art methods.

By advancing the state of the art in Indic language OCR, this work has the potential to significantly improve the accessibility and usability of Nepali and Bengali digital content. As more text is digitized and made searchable, it can enable better preservation, translation, and analysis of these historically important languages.

The Transformer-based architecture and the new dataset introduced in this paper also lay the groundwork for future research in low-resource language OCR, which could benefit many other underserved linguistic communities around the world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach

S M Rakib Hasan, Aakar Dhakal, Md Humaion Kabir Mehedi, Annajiat Alim Rasel

Efforts on the research and development of OCR systems for Low-Resource Languages are relatively new. Low-resource languages have little training data available for training Machine Translation systems or other systems. Even though a vast amount of text has been digitized and made available on the internet the text is still in PDF and Image format, which are not instantly accessible. This paper discusses text recognition for two scripts: Bengali and Nepali; there are about 300 and 40 million Bengali and Nepali speakers respectively. In this study, using encoder-decoder transformers, a model was developed, and its efficacy was assessed using a collection of optical text images, both handwritten and printed. The results signify that the suggested technique corresponds with current approaches and achieves high precision in recognizing text in Bengali and Nepali. This study can pave the way for the advanced and accessible study of linguistics in South East Asia.

4/4/2024

DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer

Da Chang, Yu Li

With the continuous development of Optical Character Recognition (OCR) and the expansion of application fields, text recognition in complex scenes has become a key challenge. Factors such as multiple fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. Although OCR models based on deep learning have performed well in specific fields or similar datasets in recent years, the generalization ability and robustness of the model are still a big challenge when facing complex environments with multiple scenes. Furthermore, training an OCR model from scratch or fine-tuning all parameters is very demanding on computing resources and inference time, which limits the flexibility of its application. This study focuses on a fundamental aspect of mixed text recognition in response to the challenges mentioned above, which involves effectively fine-tuning the pre-trained basic OCR model to demonstrate exceptional performance across various downstream tasks. To this end, we propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR. This method embeds DoRA into the image encoder and LoRA into the internal structure of the text decoder, enabling efficient parameter fine-tuning for downstream tasks. Experiments show that compared to similar parameter adjustment methods, our model DLoRA-TrOCR has the smallest number of parameters and performs better. It can achieve state-of-the-art performance on complex scene datasets involving simultaneous recognition of mixed handwritten, printed and street view texts.

4/24/2024

👁️

Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

Ahmed Mustafa, Muhammad Tahir Rafique, Muhammad Ijlal Baig, Hasan Sajid, Muhammad Jawad Khan, Karam Dad Kallu

This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.

9/2/2024

📈

Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)

Kabita Parajuli, Shashidhar Ram Joshi

Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain. This work develops a novel encoder-decoder paradigm for Nepali video captioning to tackle this difficulty. LSTM and GRU sequence-to-sequence models are used in the model to produce related textual descriptions based on features retrieved from video frames using CNNs. Using Google Translate and manual post-editing, a Nepali video captioning dataset is generated from the Microsoft Research Video Description Corpus (MSVD) dataset created using Google Translate, and manual post-editing work. The efficiency of the model for Devanagari-scripted video captioning is demonstrated by BLEU, METOR, and ROUGE measures, which are used to assess its performance.

5/21/2024