CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Read original: arXiv:2406.04493 - Published 6/10/2024 by Abdelrahman Abdallah, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt

🤔

Overview

This paper introduces the CORE dataset, a comprehensive dataset for post-OCR parsing and receipt understanding.
The dataset includes a diverse collection of real-world receipts with annotations for various document elements such as merchant name, total amount, and individual items.
The paper also presents a set of baseline models for tasks like receipt text extraction, entity recognition, and receipt understanding.

Plain English Explanation

The CORE dataset is a collection of real-world receipts that have been annotated with information about different parts of the receipt, such as the merchant name, the total amount, and the individual items purchased. This dataset is designed to help researchers and developers improve the accuracy of systems that can automatically extract and understand the information on receipts.

Receipts can be challenging for computers to read and understand because they often contain a lot of information in a small space, with handwritten notes, logos, and other elements that can make it hard to identify the key details. The CORE dataset provides a large and diverse set of examples to help train and evaluate algorithms that can parse and understand receipt data.

In addition to the receipt data, the paper also presents some baseline machine learning models that can be used as a starting point for receipt understanding tasks. These models demonstrate the current state-of-the-art in areas like extracting text from receipts and identifying the different elements on a receipt.

Overall, the CORE dataset and the baseline models provided in this paper are valuable resources for researchers and developers working on improving the automated processing of receipts and other types of documents. By having a large, well-annotated dataset and strong baseline models, the field can make progress more quickly in this important area of document understanding and extraction.

Technical Explanation

The CORE dataset is a comprehensive dataset for post-OCR (Optical Character Recognition) parsing and receipt understanding. It contains a diverse collection of over 100,000 real-world receipts from various merchants, with annotations for key document elements such as merchant name, total amount, item descriptions, and more.

The paper presents several baseline models for tasks related to receipt understanding, including:

Receipt Text Extraction: Models that can accurately extract the textual content from receipt images, handling challenges like handwritten notes, logos, and other non-textual elements.
Entity Recognition: Models that can identify and classify the different entities (e.g., merchant name, total amount, individual items) present on a receipt.
Receipt Understanding: More holistic models that can understand the overall structure and semantics of a receipt, linking together the various elements into a coherent representation.

The authors evaluate these baseline models using various metrics and compare their performance to human-level benchmarks, providing a clear assessment of the current state-of-the-art in receipt understanding.

Critical Analysis

The CORE dataset and the baseline models presented in this paper represent a significant contribution to the field of document understanding and extraction. By providing a large, well-annotated dataset of real-world receipts, the authors have addressed a key limitation in this area, where previous datasets were often small, synthetic, or lacked the diversity of real-world examples.

However, the paper does acknowledge some limitations of the dataset and the baseline models. For instance, the receipts in the dataset are primarily in English, and the authors note that extending the dataset to include receipts in other languages would be an important next step. Additionally, the baseline models, while demonstrating strong performance, still have room for improvement, particularly when it comes to handling more complex or ambiguous receipt structures.

Furthermore, the paper does not delve deeply into the potential societal implications of this work. While improved receipt understanding could have practical benefits for consumers and businesses, there may also be privacy and security concerns that need to be carefully considered, especially as these technologies become more widespread.

Overall, the CORE dataset and the baseline models presented in this paper are valuable contributions to the field of document understanding. By providing a robust dataset and strong starting points for further research, the authors have laid the groundwork for continued progress in this important area.

Conclusion

The CORE dataset and the accompanying baseline models introduced in this paper represent a significant advancement in the field of receipt understanding and post-OCR parsing. By providing a large, diverse, and well-annotated dataset of real-world receipts, the authors have addressed a crucial gap in the available resources for this domain.

The baseline models presented in the paper demonstrate the current state-of-the-art in key receipt understanding tasks, such as text extraction, entity recognition, and holistic receipt understanding. These models serve as a solid foundation for further research and development, and their performance benchmarks will help drive progress in this important area of document processing.

While the paper acknowledges some limitations of the dataset and the baseline models, the overall contribution of this work is highly valuable. By making the CORE dataset and the baseline models publicly available, the authors have opened the door for researchers and developers to build upon this work and push the boundaries of what's possible in receipt understanding and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Abdelrahman Abdallah, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt

In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (https://github.com/Update-For-Integrated-Business-AI/CORU).

6/10/2024

👁️

Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

Ahmed Mustafa, Muhammad Tahir Rafique, Muhammad Ijlal Baig, Hasan Sajid, Muhammad Jawad Khan, Karam Dad Kallu

This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.

9/2/2024

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

Jonathan Bourne

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

9/2/2024

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater

We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

6/17/2024