Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering

Read original: arXiv:2405.12533 - Published 5/22/2024 by Hiba Maryam, Ling Fu, Jiajun Song, Tajrian ABM Shafayet, Qidi Luo, Xiang Bai, Yuliang Liu
Total Score

0

šŸŒæ

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This research paper focuses on developing Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies.
  • The goal is to improve accessibility, information retrieval, and linguistic diversity in digital content, enabling better understanding and interaction with Urdu-language visual data.
  • The researchers propose a new multi-task Urdu scene text dataset with over 1000 natural scene images, annotated for text detection, recognition, and VQA tasks.
  • The dataset addresses limitations of previous datasets by providing fine-grained annotations for arbitrary-shaped texts, facilitating the development of methods that can handle diverse text layouts, intricate shapes, and non-standard orientations.
  • The VQA annotations make this the first benchmark for Urdu Text VQA, which can prompt the development of Urdu scene text understanding.

Plain English Explanation

The researchers are working on developing new technologies to better understand and interact with Urdu-language visual content, such as images and videos. This is important for improving accessibility, information retrieval, and supporting linguistic diversity in the digital world.

To achieve this, the researchers have created a new dataset of over 1000 natural scene images that are annotated in detail. This means they have identified and labeled the text that appears in these images, including text with unusual shapes or orientations that can be challenging for existing systems to handle.

Additionally, the dataset includes "Visual Question Answering" (VQA) annotations. VQA is a technology that allows users to ask questions about the content of an image, and the system can then provide relevant answers. By including VQA annotations in this Urdu-language dataset, the researchers are creating the first benchmark for Urdu Text VQA, which can help drive the development of better Urdu scene text understanding capabilities.

Overall, this research aims to bridge the gap between textual and visual comprehension for Urdu-language content, making it more accessible and useful for a wider range of applications and users.

Technical Explanation

The researchers propose a new multi-task Urdu scene text dataset, which can be used for text detection, recognition, and Visual Question Answering (VQA) tasks. The dataset contains over 1000 natural scene images with fine-grained annotations for text instances, addressing the limitations of previous datasets that struggled with arbitrary-shaped texts.

By incorporating additional annotation points, the dataset facilitates the development and assessment of methods that can handle diverse text layouts, intricate shapes, and non-standard orientations commonly encountered in real-world scenarios. This is important for improving the performance of Urdu scene text understanding systems, which are crucial for advancing accessibility, information retrieval, and linguistic diversity in digital content.

Furthermore, the VQA annotations make this dataset the first benchmark for the Urdu Text VQA method, which can prompt the development of Urdu scene text understanding capabilities. This is a significant step forward, as VQA technology allows users to interact with visual content by asking questions and receiving relevant answers, enhancing the understanding and utilization of Urdu-language visual data.

The proposed dataset is available on GitHub, allowing researchers and developers to access and utilize it for their own projects, further advancing the field of Urdu scene text detection, recognition, and VQA.

Critical Analysis

The researchers have addressed several limitations of previous Urdu scene text datasets by providing fine-grained annotations for arbitrary-shaped texts. This is a notable improvement, as handling diverse text layouts, intricate shapes, and non-standard orientations is a key challenge in real-world scene text understanding.

However, the dataset is still limited to a relatively small size of 1000 images. While this is a good starting point, expanding the dataset with a larger and more diverse set of images could further strengthen the benchmarking of Urdu scene text detection, recognition, and VQA methods.

Additionally, the paper does not provide a detailed analysis of the performance of existing state-of-the-art methods on this new dataset. Comparing the performance of leading approaches and identifying areas for improvement would help the research community better understand the current capabilities and limitations of Urdu scene text understanding technologies.

Future research could also explore the integration of this dataset with other Urdu-language resources, such as text corpora or knowledge bases, to further enhance the understanding and utilization of Urdu visual data in real-world applications.

Conclusion

This research paper presents a significant step forward in the development of Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies. By introducing a new multi-task Urdu scene text dataset with fine-grained annotations, the researchers have created a valuable resource for advancing the field and addressing the limitations of previous datasets.

The inclusion of VQA annotations makes this dataset the first benchmark for Urdu Text VQA, which can spur the development of Urdu scene text understanding capabilities. This is a crucial advancement, as it can improve accessibility, information retrieval, and linguistic diversity in digital content involving Urdu-language visual data.

Overall, this research contributes to bridging the gap between textual and visual comprehension for Urdu, ultimately enabling better understanding and interaction with Urdu-language visual content in a wide range of applications.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on š• ā†’

Related Papers

šŸŒæ

Total Score

0

Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering

Hiba Maryam, Ling Fu, Jiajun Song, Tajrian ABM Shafayet, Qidi Luo, Xiang Bai, Yuliang Liu

The development of Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies is crucial for advancing accessibility, information retrieval, and linguistic diversity in digital content, facilitating better understanding and interaction with Urdu-language visual data. This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images, which can be used for text detection, recognition, and VQA tasks. We provide fine-grained annotations for text instances, addressing the limitations of previous datasets for facing arbitrary-shaped texts. By incorporating additional annotation points, this dataset facilitates the development and assessment of methods that can handle diverse text layouts, intricate shapes, and non-standard orientations commonly encountered in real-world scenarios. Besides, the VQA annotations make it the first benchmark for the Urdu Text VQA method, which can prompt the development of Urdu scene text understanding. The proposed dataset is available at: https://github.com/Hiba-MeiRuan/Urdu-VQA-Dataset-/tree/main

Read more

5/22/2024

ā†—ļø

Total Score

0

UQA: Corpus for Urdu Question Answering

Samee Arif, Sualeha Farid, Awais Athar, Agha Ali Raza

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.

Read more

7/24/2024

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Total Score

0

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (textbf{Vi}etnamese textbf{Text}-based textbf{V}isual textbf{Q}uestion textbf{A}nswering dataset) which contains textbf{over 16,000} images and textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this href{https://github.com/minhquan6203/ViTextVQA-Dataset}{link} for research purposes.

Read more

4/17/2024

šŸ‘€

Total Score

0

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset. Furthermore, we introduce a novel approach, called VisionReader, which achieved 0.4116 in EM and 0.6990 in the F1-score on the test set. Through the results, we found that the OCR system plays a very important role in VQA models on the ViOCRVQA dataset. In addition, the objects in the image also play a role in improving model performance. We open access to our dataset at link (https://github.com/qhnhynmm/ViOCRVQA.git) for further research in OCR-VQA task in Vietnamese.

Read more

4/30/2024