Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

Read original: arXiv:2305.08144 - Published 6/14/2024 by Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao and 1 other

🔄

Overview

This paper introduces a new framework called Mobile-Env for creating comprehensive benchmarks to evaluate the performance of large language models (LLMs) and vision-language models (VLMs) in interacting with the graphical user interface (GUI) of mobile apps.
The authors argue that current benchmarks often fail to provide trustworthy and reproducible evaluations of GUI agents, and Mobile-Env offers an isolated and controlled environment to address this challenge.
Using Mobile-Env, the authors collect a diverse set of real-world tasks across various mobile apps, as well as a fixed set of tasks based on WikiHow, to thoroughly assess the capabilities of advanced models like GPT-4V and LLaMA-3.
The results reveal that even these state-of-the-art models struggle with relatively simple tasks that are easy for humans, highlighting a crucial gap in current AI capabilities and the need for more effective GUI agent frameworks.

Plain English Explanation

The Graphical User Interface (GUI) is the primary way that humans interact with digital devices and applications. Recent advancements in Large Language Models (LLMs) and Vision Language Models (VLMs) have opened up the possibility of creating advanced AI agents that can interact with GUIs, much like a human would.

To ensure these AI agents are effective, the researchers developed a new tool called Mobile-Env that provides a reliable and standardized way to test their performance. Mobile-Env creates a controlled environment to evaluate how well these AI models can complete tasks within mobile apps, such as navigating menus, finding information, or carrying out specific actions.

Using Mobile-Env, the researchers collected a diverse set of real-world tasks from various mobile apps, as well as a fixed set of tasks based on the WikiHow website, which contains a lot of dynamic online content. They then tested some of the most advanced AI models, like GPT-4V and LLaMA-3, to see how well they could handle these GUI-based tasks.

Surprisingly, the results showed that even these state-of-the-art models struggled with relatively simple tasks that would be easy for a human to complete. This reveals a significant gap between the capabilities of current AI systems and the flexibility and adaptability of human intelligence when it comes to interacting with digital interfaces. The researchers believe this underscores the importance of developing more capable foundation models and more effective frameworks for building AI agents that can truly excel at GUI-based tasks.

Technical Explanation

The paper introduces Mobile-Env, a comprehensive toolkit designed to create reliable and reproducible benchmarks for evaluating the performance of LLMs and VLMs in interacting with mobile app GUIs. The authors argue that current benchmarks often fail to provide trustworthy evaluations, as they lack the necessary isolation and control to ensure consistent and comparable results.

Mobile-Env offers an isolated and controllable setting for GUI agent evaluations. It accommodates intermediate instructions and rewards to better reflect real-world usage patterns, rather than just final outcomes. Using this framework, the researchers collected two types of task sets: an "open-world" set covering various real-world mobile apps, and a "fixed world" set based on the WikiHow website, which captures a significant amount of dynamic online content.

The authors then conducted comprehensive evaluations of advanced LLM agents, including GPT-4V and LLaMA-3, on these benchmarks. The results reveal that even these state-of-the-art models struggle with tasks that are relatively simple for humans, highlighting a crucial gap in current AI capabilities. This underscores the importance of developing more capable foundation models and more effective GUI agent frameworks to bridge the divide between human and machine abilities in interacting with digital interfaces.

Critical Analysis

The researchers have made a compelling case for the need to develop robust and reliable benchmarks for evaluating the performance of GUI agents, as current approaches often fall short in providing trustworthy and reproducible evaluations. The introduction of Mobile-Env is a step in the right direction, as it offers a controlled and isolated environment that can accommodate more nuanced task design and reward structures.

However, the paper does not provide a detailed analysis of the specific limitations of existing benchmarks, nor does it compare the performance of Mobile-Env-based evaluations to those conducted using other frameworks. It would be valuable to see a more comprehensive comparison to better understand the unique advantages and potential drawbacks of the proposed approach.

Additionally, the paper focuses primarily on the performance of LLM agents, but it does not delve into the specific architectural or training details that may contribute to their struggles with the GUI-based tasks. A deeper investigation into the underlying factors that hinder the models' abilities could provide valuable insights for future model development and training strategies.

Furthermore, the authors acknowledge that the collected task sets, while diverse, may not be fully representative of the full spectrum of real-world GUI interactions. Expanding the task set and exploring the performance of other state-of-the-art models, such as multimodal or task-specific agents, could further strengthen the conclusions and provide a more holistic understanding of the current state of GUI agent capabilities.

Conclusion

The Graphical User Interface (GUI) is a crucial component of human-digital interaction, and the development of advanced Large Language Models (LLMs) and Vision Language Models (VLMs) has opened up new possibilities for creating AI agents that can interact with GUIs. However, the lack of reliable and reproducible benchmarks has hindered the assessment of these agents' capabilities.

The introduction of Mobile-Env by the researchers provides a valuable framework for creating comprehensive and controlled benchmarks to evaluate GUI agent performance. The results of their evaluations, which show that even state-of-the-art models like GPT-4V and LLaMA-3 struggle with relatively simple GUI-based tasks, highlight a crucial gap in current AI capabilities. This underscores the importance of continued research and development to create more effective foundation models and GUI agent frameworks that can bridge the divide between human and machine abilities in the digital realm.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, Kai Yu

The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks. Recent progress in Large Language Models (LLMs) and Vision Language Models (VLMs) offers the chance to create advanced GUI agents. To ensure their effectiveness, there's a pressing need for qualified benchmarks that provide trustworthy and reproducible evaluations -- a challenge current benchmarks often fail to address. To tackle this issue, we introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment. Mobile-Env offers an isolated and controllable setting for reliable evaluations, and accommodates intermediate instructions and rewards to reflect real-world usage more naturally. Utilizing Mobile-Env, we collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents for fully controllable and reproducible evaluation. We conduct comprehensive evaluations of LLM agents using these benchmarks. Our findings reveal that even advanced models (e.g., GPT-4V and LLaMA-3) struggle with tasks that are relatively simple for humans. This highlights a crucial gap in current models and underscores the importance of developing more capable foundation models and more effective GUI agent frameworks.

6/14/2024

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen

Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) and their potential to autonomously manage daily tasks. Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents, due to the inexhaustible states of apps and the vague definition of feasible action sequences. To address this challenge, we propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing. We initially define 100 tasks across 10 open-source apps, categorized by multiple levels of difficulty. Subsequently, we evaluate several existing mobile agents, including AppAgent and MobileAgent, to thoroughly and systematically compare their performance. All materials are accessible on our project webpage: https://MobileAgentBench.github.io, contributing to the advancement of both academic and industrial fields.

6/13/2024

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, Shuo Shang

With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion. Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs. To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.

7/2/2024

GUICourse: From General Vision Language Models to Versatile GUI Agents

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

6/18/2024