Image Based Character Recognition, Documentation System To Decode Inscription From Temple

Read original: arXiv:2405.17449 - Published 5/29/2024 by Velmathi G, Shangavelan M, Harish D, Krithikshun M S

🖼️

Overview

• This research project focused on applying optical character recognition (OCR) methods to decipher ancient Tamil inscriptions discovered on the walls of the Brihadeeswarar Temple, which date back to the 10th century.

• The researchers experimented with the popular Tesseract OCR engine, using modern ICR (intelligent character recognition) techniques to preprocess the raw data and a box editing software to fine-tune their model.

• The goal was to evaluate the effectiveness of these OCR methods in accurately deciphering the nuances of the ancient Tamil characters, which pose unique challenges due to their historical context.

Plain English Explanation

• The researchers in this study wanted to use modern computer vision and text recognition techniques to read and interpret ancient Tamil inscriptions that were discovered on the walls of a 10th century temple.

• These ancient inscriptions are written in an old form of the Tamil language, which can be difficult for computers to understand compared to modern text. The researchers tried out a popular OCR software called Tesseract, along with some preprocessing steps and manual editing, to see how accurately they could decipher the ancient Tamil characters.

• The key goal was to evaluate how well these OCR techniques could handle the unique challenges posed by the historical and cultural context of these ancient inscriptions, with the hope of improving the preservation and interpretation of such valuable historical artifacts.

Technical Explanation

• The researchers used the Tesseract OCR engine as the core of their approach, leveraging its modern ICR techniques to preprocess the raw image data of the ancient Tamil inscriptions.

• They also employed a box editing software to fine-tune their Tesseract-based model, allowing them to make manual adjustments and corrections to improve its performance on this specialized dataset.

• The team divided the available dataset of ancient inscriptions into training and testing sets, in order to evaluate the accuracy of their OCR model on unseen data. This allowed them to assess the effectiveness of their approach in deciphering the nuances of the ancient Tamil script.

Critical Analysis

• The paper acknowledges the unique challenges posed by the historical context of the ancient Tamil inscriptions, such as variations in character shapes, layout, and material condition over time.

• While the researchers demonstrate promising results using their Tesseract-based OCR approach, they also note that further refinement and adaptation may be necessary to achieve even higher accuracy rates, particularly for less common or more degraded characters.

• Additional research could explore the use of more advanced neural network-based OCR techniques, as well as the incorporation of domain-specific linguistic knowledge, to better handle the complexities of the ancient Tamil script.

Conclusion

• This research project demonstrates the potential of applying modern OCR techniques to decipher and preserve ancient historical inscriptions, in this case the 10th century Tamil inscriptions found on the walls of the Brihadeeswarar Temple.

• By experimenting with the Tesseract OCR engine and leveraging preprocessing and manual editing, the researchers were able to make progress in accurately reading these challenging ancient texts, which could have valuable implications for the study and preservation of such historical artifacts.

• However, the unique challenges posed by the historical and cultural context of these inscriptions suggest that further advancements in OCR technology and domain-specific knowledge may be necessary to fully unlock the wealth of information contained in these and similar ancient records.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

Image Based Character Recognition, Documentation System To Decode Inscription From Temple

Velmathi G, Shangavelan M, Harish D, Krithikshun M S

This project undertakes the training and analysis of optical character recognition OCR methods applied to 10th century ancient Tamil inscriptions discovered on the walls of the Brihadeeswarar Temple.The chosen OCR methods include Tesseract,a widely used OCR engine,using modern ICR techniques to pre process the raw data and a box editing software to finetune our model.The analysis with Tesseract aims to evaluate their effectiveness in accurately deciphering the nuances of the ancient Tamil characters.The performance of our model for the dataset are determined by their accuracy rate where the evaluated dataset divided into training set and testing set.By addressing the unique challenges posed by the script's historical context,this study seeks to contribute valuable insights to the broader field of OCR,facilitating improved preservation and interpretation of ancient inscriptions

5/29/2024

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Ameer Majeed, Hossein Hassani

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

8/27/2024

🛸

Efficient OCR for Building a Diverse Digital History

Jacob Carlson, Tom Bryan, Melissa Dell

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.

7/29/2024

👁️

Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

Ahmed Mustafa, Muhammad Tahir Rafique, Muhammad Ijlal Baig, Hasan Sajid, Muhammad Jawad Khan, Karam Dad Kallu

This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.

9/2/2024