ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Read original: arXiv:2402.04615 - Published 7/8/2024 by Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cu{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma

📈

Overview

The paper introduces ScreenAI, a vision-language model that specializes in understanding user interfaces (UIs) and infographics.
ScreenAI builds upon the PaLI architecture and incorporates the flexible patching strategy of pix2struct.
The model is trained on a unique mixture of datasets, including a novel screen annotation task that identifies the type and location of UI elements.
The text annotations from this task are used to create QA, UI navigation, and summarization datasets for training large language models.

Plain English Explanation

The paper discusses ScreenAI, a new AI model that is specifically designed to understand and work with user interfaces (UIs) and infographics. These visual elements, which share similar design principles, play an important role in how humans communicate and interact with machines.

ScreenAI is built on top of an existing model called PaLI, but it has been enhanced with a flexible "patching" strategy that allows it to better understand the structure and components of UIs and infographics. The researchers trained ScreenAI on a unique combination of datasets, including a novel task where the model has to identify the different types of UI elements (like buttons, menus, etc.) and where they are located on the screen.

By teaching the model to understand the text and visual elements of UIs and infographics, the researchers were able to automatically generate large datasets for training other AI systems. These datasets cover things like answering questions about the content, navigating through the UI, and summarizing the key information.

The end result is that ScreenAI, which is relatively small at only 5 billion parameters, is able to outperform much larger models on a variety of tasks related to UIs and infographics. This includes benchmarks like Multi-page DocVQA, WebSRC, and MoTIF.

Technical Explanation

The key innovation in ScreenAI is the use of a novel "screen annotation" task during training. In this task, the model has to identify the type (e.g., button, menu, text field) and location of different UI elements on a screen.

The researchers used the text annotations from this task to automatically generate large-scale datasets for training the model on question-answering, UI navigation, and summarization. This allowed ScreenAI to learn how to understand and manipulate UIs and infographics in a more targeted way compared to general-purpose vision-language models.

ScreenAI builds upon the PaLI architecture, which combines computer vision and natural language processing capabilities. The researchers added the flexible "patching" strategy from the pix2struct model, which allows the system to better adapt to the structural components of UIs and infographics.

Through extensive ablation studies, the researchers demonstrated the importance of their training data mixture and architectural choices. The result is that ScreenAI, despite being a relatively small model at 5 billion parameters, is able to outperform much larger models on a variety of UI- and infographics-focused benchmarks.

Critical Analysis

The researchers provide a thorough evaluation of ScreenAI, including comparisons to other state-of-the-art models. They highlight the model's strong performance on specialized tasks like Widget Captioning, as well as its impressive results on more general benchmarks like Chart QA and DocVQA.

However, the paper does not delve into the potential limitations or failure cases of ScreenAI. It would be helpful to understand the types of UI or infographic elements that the model struggles with, or any biases or inconsistencies in its performance. Additionally, the paper does not discuss potential privacy or security concerns that could arise from using such a powerful UI-understanding model in real-world applications.

Further research could explore how ScreenAI's capabilities could be extended to other domains, such as mobile app development, data visualization, or even assistive technologies for users with disabilities. Investigating the model's robustness to adversarial attacks or its ability to generalize to new UI paradigms would also be valuable.

Conclusion

The ScreenAI model represents a significant advance in the field of vision-language understanding, with a particular focus on user interfaces and infographics. By incorporating a novel screen annotation task and leveraging the flexible patching strategy of pix2struct, the researchers have created a model that can outperform larger, more general-purpose systems on a variety of specialized benchmarks.

The ability to automatically generate large-scale datasets for training other AI models is a particularly notable contribution, as it opens up new possibilities for developing more intelligent and user-friendly human-machine interaction systems. As the use of visual interfaces continues to grow, tools like ScreenAI will become increasingly important for bridging the gap between human communication and machine understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cu{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

7/8/2024

On AI-Inspired UI-Design

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, G'erard Dray, Walid Maalej

Graphical User Interface (or simply UI) is a primary mean of interaction between users and their device. In this paper, we discuss three major complementary approaches on how to use Artificial Intelligence (AI) to support app designers create better, more diverse, and creative UI of mobile apps. First, designers can prompt a Large Language Model (LLM) like GPT to directly generate and adjust one or multiple UIs. Second, a Vision-Language Model (VLM) enables designers to effectively search a large screenshot dataset, e.g. from apps published in app stores. The third approach is to train a Diffusion Model (DM) specifically designed to generate app UIs as inspirational images. We discuss how AI should be used, in general, to inspire and assist creative app design rather than automating it.

6/21/2024

🌐

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

Yu-Chung Hsiao, Fedir Zubach, Gilles Baechler, Victor Carbune, Jason Lin, Maria Wang, Srinivas Sunkara, Yun Zhu, Jindong Chen

We present a new benchmark and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 86K question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity. This work is also the first to annotate answers for different application scenarios, including both full sentences and short forms, as well as supporting UI contents on screen and their bounding boxes. With the rich annotation, we discuss and define the evaluation metrics of the benchmark, show applications of the dataset, and provide a few baselines using closed and open source models.

7/31/2024

Tell Me What's Next: Textual Foresight for Generic UI Representations

Andrea Burns, Kate Saenko, Bryan A. Plummer

Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To combat this, we propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. Our approach requires joint reasoning over elements and entire screens, resulting in improved UI features: on generation tasks, UI agents trained with Textual Foresight outperform state-of-the-art by 2% with 28x fewer images. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning. OpenApp enables new baselines, and we find Textual Foresight improves average task performance over them by 5.7% while having access to 2x less data.

6/13/2024