You Only Look at Screens: Multimodal Chain-of-Action Agents

Read original: arXiv:2309.11436 - Published 6/10/2024 by Zhuosheng Zhang, Aston Zhang

👁️

Overview

This paper introduces Auto-GUI, a multimodal solution that allows large language models (LLMs) to directly interact with graphical user interfaces (GUIs) without the need for external tools or application-specific APIs.
The paper proposes a chain-of-action technique that leverages a series of intermediate previous action histories and future action plans to help the agent decide what action to execute.
The approach is evaluated on a new device-control benchmark called AITW, which includes 30,000 unique instructions spanning multi-step tasks such as application operation, web searching, and web shopping.

Plain English Explanation

The paper presents a new way for AI models to interact with graphical user interfaces (GUIs) and automate tasks without manual intervention.

Traditionally, approaches that use large language models (LLMs) to interact with GUIs have relied on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. This can lead to inefficient inference and the risk of errors propagating through the system.

To address these challenges, the authors introduce Auto-GUI, a multimodal solution that allows the AI model to directly interact with the GUI, bypassing the need for environment parsing or reliance on application-dependent APIs. The key innovation is a "chain-of-action" technique that helps the model decide what action to take by considering a series of previous actions and future action plans.

The approach is evaluated on a new benchmark called AITW, which includes 30,000 unique instructions for tasks like operating applications, web searching, and web shopping. The results show that Auto-GUI achieves state-of-the-art performance, with an action type prediction accuracy of 90% and an overall action success rate of 74%.

Technical Explanation

The Auto-GUI framework directly interacts with the graphical user interface, bypassing the need for environment parsing or reliance on application-dependent APIs. This allows the system to avoid the inefficient inference and error propagation risks associated with traditional approaches that rely on external tools and APIs.

The core innovation is the chain-of-action technique, which helps the agent decide what action to take by considering a series of intermediate previous action histories and future action plans. This allows the agent to reason about the consequences of its actions and develop a more coherent strategy for accomplishing complex, multi-step tasks.

The system is evaluated on the AITW benchmark, which includes 30,000 unique instructions across a diverse range of GUI-based tasks, such as application operation, web searching, and web shopping. The results show that Auto-GUI achieves state-of-the-art performance, with an action type prediction accuracy of 90% and an overall action success rate of 74%.

Critical Analysis

The paper presents a promising approach for enabling AI models to directly interact with graphical user interfaces, but there are a few potential limitations and areas for further research:

The evaluation is limited to a single benchmark, AITW, which may not capture the full range of challenges and use cases in real-world GUI interactions. Further evaluation on a more diverse set of benchmarks would help validate the generalizability of the approach.
The paper does not provide a detailed analysis of the types of errors or failure modes encountered by the system. Understanding these failure cases could inform future improvements to the chain-of-action technique or the overall system design.
The integration of multimodal inputs and outputs (e.g., vision, touch, speech) could potentially enhance the system's capabilities and robustness, but this is not explored in the current work.

Overall, the Auto-GUI approach represents an important step forward in enabling AI agents to effectively interact with graphical user interfaces, with promising results that warrant further investigation and development.

Conclusion

The Auto-GUI framework introduces a novel solution for enabling large language models to directly interact with graphical user interfaces, bypassing the need for external tools and application-specific APIs. By leveraging a chain-of-action technique to reason about the consequences of its actions, the system can effectively automate complex, multi-step tasks in a diverse range of GUI-based environments.

The promising results on the AITW benchmark suggest that this approach could have significant implications for the development of autonomous agents capable of assisting humans with a wide variety of computer-based tasks. As the field of AI-powered user interfaces continues to evolve, the insights and techniques presented in this paper may pave the way for more intelligent and seamless interactions between humans and machines.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

You Only Look at Screens: Multimodal Chain-of-Action Agents

Zhuosheng Zhang, Aston Zhang

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, most existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-GUI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-GUI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-GUI.

6/10/2024

🚀

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

Xinbei Ma, Zhuosheng Zhang, Hai Zhao

Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation. However, those GUI agents require comprehensive cognition ability including exhaustive perception and reliable action response. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type. With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios. Code is available at https://github.com/xbmxb/CoCo-Agent.

6/4/2024

GUICourse: From General Vision Language Models to Versatile GUI Agents

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

6/18/2024

GUI Action Narrator: Where and When Did That Action Take Place?

Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset textbf{Act2Cap} as well as a simple yet effective framework, textbf{GUI Narrator}, for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a multimodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today's most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models.

6/21/2024