GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Read original: arXiv:2406.10819 - Published 6/18/2024 by Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li and 10 others

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Overview

This paper introduces a new dataset called "GUI-World" for training and evaluating multimodal language models (LLMs) on tasks related to graphical user interfaces (GUIs).
The dataset contains over 100,000 annotated GUI images, along with associated natural language descriptions, offering a comprehensive resource for developing GUI-oriented AI agents.
The paper also presents benchmarks for various GUI-related tasks, including GUI understanding, generation, and interaction, to drive progress in this emerging field.

Plain English Explanation

The researchers have created a new dataset called "GUI-World" that can be used to train and test AI systems that interact with graphical user interfaces (GUIs), such as the windows and menus that we use on computers and smartphones. This dataset includes over 100,000 images of different GUIs, along with text descriptions of what each GUI does and how it can be used.

By providing this comprehensive dataset, the researchers hope to spur the development of more advanced AI agents that can understand, generate, and interact with GUIs in natural and intuitive ways. This could have important applications in areas like user interface design, productivity software, and even virtual assistants that can help people navigate complex computer systems.

The paper also presents a set of benchmark tasks that researchers can use to measure the performance of their AI models on various GUI-related abilities, such as identifying the different components of a GUI or generating natural language descriptions of how a GUI works. These benchmarks will help drive progress in this emerging field of "GUI-oriented AI."

Technical Explanation

The GUI-World dataset [<a href="https://aimodels.fyi/papers/arxiv/guicourse-from-general-vision-language-models-to">1</a>] contains over 100,000 annotated GUI images, along with associated natural language descriptions. This provides a comprehensive resource for training and evaluating multimodal language models (LLMs) on tasks related to graphical user interfaces (GUIs).

The dataset covers a diverse range of GUI types, including desktop applications, mobile apps, and web interfaces. Each GUI image is annotated with bounding boxes and labels for the various GUI components, such as buttons, menus, and text fields. The natural language descriptions explain the purpose and functionality of each GUI, as well as how a user might interact with it.

The researchers propose several benchmark tasks to drive progress in GUI-oriented AI [<a href="https://aimodels.fyi/papers/arxiv/you-only-look-at-screens-multimodal-chain">2</a>, <a href="https://aimodels.fyi/papers/arxiv/worldgpt-empowering-llm-as-multimodal-world-model">3</a>, <a href="https://aimodels.fyi/papers/arxiv/mmworld-towards-multi-discipline-multi-faceted-world">4</a>, <a href="https://aimodels.fyi/papers/arxiv/v-zen-efficient-gui-understanding-precise-grounding">5</a>]. These include GUI understanding (e.g., identifying GUI components and their relationships), GUI generation (e.g., creating natural language descriptions of GUIs), and GUI interaction (e.g., following step-by-step instructions to complete tasks in a GUI).

By providing this comprehensive dataset and set of benchmarks, the researchers aim to accelerate the development of multimodal AI agents that can effectively perceive, understand, and interact with graphical user interfaces.

Critical Analysis

The GUI-World dataset represents a valuable contribution to the field of multimodal AI, as it addresses an important gap in the availability of high-quality datasets for GUI-related tasks. The dataset's diversity and scale, as well as the detailed annotations, make it a compelling resource for training and evaluating advanced AI models.

However, the paper does not provide extensive details on the dataset's composition, such as the distribution of GUI types, the range of GUI complexity, or the quality control measures used in the annotation process. Further information on these aspects would be helpful for researchers to better understand the dataset's strengths and limitations.

Additionally, while the proposed benchmark tasks are well-aligned with real-world GUI-related challenges, the paper could have discussed the potential practical implications and applications of this research in more depth. Exploring how these advances in GUI-oriented AI might impact areas like user interface design, productivity software, or virtual assistants would help contextualize the significance of this work.

Overall, the GUI-World dataset and the associated benchmarks represent a valuable contribution to the field of multimodal AI, and the paper provides a solid foundation for further research and development in this emerging area.

Conclusion

The GUI-World dataset and the accompanying benchmarks introduced in this paper represent an important step forward in the development of multimodal AI agents capable of effectively perceiving, understanding, and interacting with graphical user interfaces.

By providing a comprehensive dataset of over 100,000 annotated GUI images, along with natural language descriptions, the researchers have created a valuable resource for training and evaluating advanced AI models on a wide range of GUI-related tasks. The proposed benchmarks, covering areas such as GUI understanding, generation, and interaction, will help drive progress in this emerging field and unlock new possibilities for AI-powered applications that seamlessly integrate with user interfaces.

As the use of GUIs continues to proliferate across computing devices and software platforms, the ability of AI systems to efficiently and intuitively navigate these environments will become increasingly crucial. The GUI-World dataset and the research it enables have the potential to contribute significantly to the development of more capable, user-friendly, and accessible AI-powered tools and services.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.

6/18/2024

GUICourse: From General Vision Language Models to Versatile GUI Agents

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

6/18/2024

📈

WorldGPT: Empowering LLM as Multimodal World Model

Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, Yueting Zhuang

World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on WorldNet directly demonstrates WorldGPT's capability to accurately model state transition patterns, affirming its effectiveness in understanding and predicting the dynamics of complex scenarios. We further explore WorldGPT's emerging potential in serving as a world simulator, helping multimodal agents generalize to unfamiliar domains through efficiently synthesising multimodal instruction instances which are proved to be as reliable as authentic data for fine-tuning purposes. The project is available on url{https://github.com/DCDmllm/WorldGPT}.

4/30/2024

👁️

You Only Look at Screens: Multimodal Chain-of-Action Agents

Zhuosheng Zhang, Aston Zhang

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, most existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-GUI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-GUI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-GUI.

6/10/2024