MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Read original: arXiv:2405.18924 - Published 5/30/2024 by Miguel A. Ferrer, Abhijit Das, Moises Diaz, Aythami Morales, Cristina Carmona-Duarte, Umapada Pal

🗣️

Overview

This paper introduces a new database for benchmarking script identification algorithms, which is crucial for applications involving handwriting and document analysis in multilingual environments.
The dataset includes printed and handwritten documents from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai.
The dataset consists of 1,135 documents, including lines and words, which can be used to evaluate script identification performance at different levels.
The paper proposes benchmark tests using both handcrafted and deep learning methods, with results reported for printed and handwritten documents, as well as at the document, line, and word levels.

Plain English Explanation

Script identification is the process of determining the writing system or language used in a given document or text. This is an important task for applications that need to work with handwritten or printed materials from various linguistic backgrounds, such as document analysis systems or multilingual interfaces.

The researchers in this paper have created a new dataset that can be used to test and compare different script identification algorithms. The dataset includes a wide variety of writing systems, including both printed and handwritten samples. This allows researchers to evaluate how well their algorithms can handle different types of text, from neatly printed documents to messy, handwritten notes.

The dataset is divided into separate documents, lines, and words, so algorithms can be tested at different levels of granularity. For example, an algorithm might be able to correctly identify the script of an entire document, but struggle with individual words or lines of text.

The paper also provides some benchmark results, using both traditional, rule-based techniques as well as more modern, machine learning approaches. This gives other researchers a starting point to compare their own script identification models against.

Overall, this new dataset is expected to help drive progress in script identification research, by providing a common set of materials for testing and comparing different techniques. This could lead to improved handwriting and document analysis systems that can work seamlessly across multiple languages and writing styles.

Technical Explanation

The dataset introduced in this paper contains 1,135 documents, including both printed and handwritten text, covering a diverse range of scripts: Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. These documents were scanned from local newspapers and handwritten letters/notes.

The dataset has been segmented into 13,979 lines and 86,655 words, allowing for script identification to be evaluated at the document, line, and word levels. This multi-level approach provides a more comprehensive assessment of algorithm performance compared to previous datasets.

The paper establishes several benchmark tests, using both handcrafted feature-based techniques and deep learning methods. These benchmarks include:

Document-level script identification: Determining the script used in an entire document.
Line-level script identification: Identifying the script used in individual lines of text.
Word-level script identification: Classifying the script of individual words.

The benchmarks are run on both the printed and handwritten subsets of the dataset, allowing for the evaluation of algorithm robustness across different text types.

The reported results provide a baseline for future research, highlighting the performance of existing methods and the challenges posed by this new, diverse dataset. This is expected to spur the development of more advanced script identification algorithms, capable of handling a wide range of writing systems, both in printed and handwritten form.

Critical Analysis

The dataset and benchmarks presented in this paper represent a significant contribution to the field of script identification research. By including a wide variety of writing systems, both printed and handwritten, the dataset provides a more comprehensive and realistic testbed compared to previous resources.

However, the paper does not delve into the potential limitations or biases in the dataset. For example, it is unclear if the distribution of scripts in the dataset accurately reflects real-world usage patterns, or if certain scripts are overrepresented. Additionally, the quality and consistency of the handwritten samples may vary, which could impact the performance of machine learning-based approaches.

Furthermore, the paper does not discuss the potential societal impacts of improved script identification technology. While this research has clear applications in document analysis and multilingual interfaces, it is important to consider how such technologies could be used in less benign contexts, such as surveillance or identification of individuals based on their handwriting.

Future research in this area should also explore the connections between script identification and human cognition, as mentioned in the introduction. Understanding the cognitive processes underlying script recognition could lead to more robust and human-centric algorithms.

Conclusion

The new dataset and benchmarks introduced in this paper represent an important step forward in script identification research. By providing a diverse and comprehensive testbed, the study lays the foundation for the development of more advanced algorithms capable of handling a wide range of writing systems, both in printed and handwritten form.

The results reported in the paper serve as a baseline for future work, and the dataset is expected to spur further innovation in the field, leading to improved document analysis and multilingual applications. However, it is crucial that future research in this area also considers potential limitations, biases, and societal implications of the developed technologies.

Overall, this paper makes a valuable contribution to the field of script identification, and the resources it provides are likely to have a significant impact on the future of document analysis and related applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Miguel A. Ferrer, Abhijit Das, Moises Diaz, Aythami Morales, Cristina Carmona-Duarte, Umapada Pal

Script identification plays a vital role in applications that involve handwriting and document analysis within a multi-script and multi-lingual environment. Moreover, it exhibits a profound connection with human cognition. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given. The new multi-lingual database is expected to create new script identifiers, present various challenges, including identifying handwritten and printed samples and serve as a foundation for future research in script identification based on the reported results of the three benchmarks.

5/30/2024

MathWriting: A Dataset For Handwritten Mathematical Expression Recognition

Philippe Gervais, Asya Fadeeva, Andrii Maksai

We introduce MathWriting, the largest online handwritten mathematical expression dataset to date. It consists of 230k human-written samples and an additional 400k synthetic ones. MathWriting can also be used for offline HME recognition and is larger than all existing offline HME datasets like IM2LATEX-100K. We introduce a benchmark based on MathWriting data in order to advance research on both online and offline HME recognition.

4/17/2024

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater

We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

6/17/2024

💬

Script-Agnostic Language Identification

Milind Agarwal, Joshua Otten, Antonios Anastasopoulos

Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.

6/27/2024