ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

Read original: arXiv:2209.08199 - Published 7/31/2024 by Yu-Chung Hsiao, Fedir Zubach, Gilles Baechler, Victor Carbune, Jason Lin, Maria Wang, Srinivas Sunkara, Yun Zhu, Jindong Chen

🌐

Overview

A new benchmark and dataset called ScreenQA for understanding screen content through question answering
Aims to bridge the gap between existing screen datasets focused on structure/components or high-level tasks
Annotates 86K question-answer pairs over the RICO dataset to assess screen reading comprehension
Annotates answers in different formats (full sentences, short forms) and identifies supporting UI content

Plain English Explanation

This paper introduces a new ScreenQA dataset and benchmark for evaluating a system's ability to understand the content and functionality of screen-based user interfaces through question answering.

Existing screen datasets either focus on low-level understanding of the structure and components of a screen, or higher-level tasks like navigating an interface or completing a goal. The ScreenQA dataset aims to bridge this gap by providing a more comprehensive assessment of "screen reading comprehension" - the capacity to understand the meaning and purpose of the various elements on a screen.

The researchers annotated over 86,000 question-answer pairs based on the RICO dataset of mobile app screenshots. These questions cover a range of scenarios, and the answers are provided in different formats, including full sentences as well as concise summaries. Additionally, the annotations identify the specific user interface elements on the screen that are relevant to answering each question.

With this rich dataset, the paper discusses how to evaluate systems on the ScreenQA benchmark, provides some baseline model performance, and outlines potential applications for the dataset, such as improving the understanding of screen-based interfaces for assistive technologies.

Technical Explanation

The ScreenQA dataset is built upon the existing RICO dataset, which contains over 66,000 screenshots of mobile app user interfaces. The researchers annotated this dataset with 86,000 question-answer pairs that assess different aspects of screen comprehension.

The questions cover a variety of scenarios, such as identifying UI components, understanding their functionalities, and reasoning about the overall purpose and workflow of the app. The answers are provided in two formats: full sentences that comprehensively explain the answer, and short-form responses that concisely capture the key information.

Crucially, the annotations also identify the specific UI elements on the screen that are relevant to answering each question. This allows for more fine-grained evaluation of a system's understanding, beyond just the final answer.

The paper discusses several evaluation metrics for the ScreenQA benchmark, including exact match accuracy, F1 score, and a novel "UI coverage" metric that assesses how well a system can identify the relevant UI components. Baseline results are provided using both closed-domain and open-domain question answering models, demonstrating the challenges of the task.

The ScreenQA dataset and benchmark are presented as a valuable resource for advancing the field of screen content understanding, with potential applications in areas like assistive technology, UI design, and human-computer interaction.

Critical Analysis

The ScreenQA dataset and benchmark represent a significant step forward in evaluating screen reading comprehension, bridging the gap between low-level structure understanding and high-level task completion.

One key strength of the dataset is the rich annotation, which not only provides answers in different formats but also identifies the relevant UI elements. This allows for more nuanced evaluation of a system's understanding, beyond just the final answer.

However, the paper does acknowledge some limitations of the current dataset, such as the fact that it is based on static screenshots rather than dynamic app interactions. Additionally, the questions and answers are all in English, limiting the dataset's applicability to other languages and cultural contexts.

Further research could explore ways to extend the ScreenQA benchmark, such as incorporating more diverse app domains, supporting multilingual capabilities, or even considering temporal aspects of screen interaction. Investigating the generalization of screen reading comprehension models to new app interfaces would also be an important area for future work.

Overall, the ScreenQA dataset and benchmark represent a valuable contribution to the field of screen content understanding, providing a comprehensive assessment tool that can drive progress in areas like assistive technologies and user interface design.

Conclusion

The ScreenQA dataset and benchmark introduced in this paper offer a new way to evaluate a system's ability to understand the content and functionality of screen-based user interfaces. By annotating over 86,000 question-answer pairs based on the RICO dataset of mobile app screenshots, the researchers have created a rich resource for assessing "screen reading comprehension" - the capacity to interpret the meaning and purpose of various UI elements.

With the dataset's comprehensive annotations, including both full sentence and short-form answers as well as the identification of relevant UI components, the ScreenQA benchmark provides a more nuanced evaluation of a system's understanding beyond just the final answer. This represents a significant advancement over existing screen datasets, which have been focused either on low-level structural understanding or high-level task completion.

The potential applications of the ScreenQA dataset are wide-ranging, from improving assistive technologies to enhancing user interface design. By driving progress in screen content understanding, this work can contribute to creating more accessible and intuitive digital experiences for all users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

Yu-Chung Hsiao, Fedir Zubach, Gilles Baechler, Victor Carbune, Jason Lin, Maria Wang, Srinivas Sunkara, Yun Zhu, Jindong Chen

We present a new benchmark and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 86K question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity. This work is also the first to annotate answers for different application scenarios, including both full sentences and short forms, as well as supporting UI contents on screen and their bounding boxes. With the rich annotation, we discuss and define the evaluation metrics of the benchmark, show applications of the dataset, and provide a few baselines using closed and open source models.

7/31/2024

📈

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cu{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

7/8/2024

💬

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang

While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.

4/19/2024

🤔

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

Yoonsik Kim, Moonbin Yim, Ka Yeon Song

In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at href{https://github.com/naver-ai/tablevqabench}{https://github.com/naver-ai/tablevqabench}.

5/1/2024