AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Read original: arXiv:2405.14573 - Published 6/11/2024 by Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala and 5 others

👨‍🏫

Overview

Presents AndroidWorld, a fully functional Android environment that provides reward signals for over 100 programmatic task workflows across 20 real-world Android applications
Designed to enable testing of autonomous agents that can execute human tasks by controlling computers, with the goal of enhancing human productivity and application accessibility
Introduces a new computer control agent called M3A, which can complete 30.6% of AndroidWorld's tasks
Adapts a popular desktop web agent to work on Android, finding it less effective on mobile, suggesting the need for further research to achieve universal, cross-domain agents
Conducts a robustness analysis to demonstrate that variations in task parameters can significantly alter the complexity of a task and an agent's performance

Plain English Explanation

AndroidWorld is a virtual Android environment that allows researchers to test how well autonomous agents can perform real-world tasks on mobile devices. Unlike existing interactive environments, which provide a static set of test tasks, AndroidWorld dynamically constructs tasks that are expressed in natural language and can be parameterized in unlimited ways. This enables testing on a much larger and more realistic suite of tasks.

The researchers introduced a new computer control agent called M3A, which can complete about a third of the tasks in AndroidWorld. They also tried adapting a popular desktop web agent to work on Android, but found it less effective on mobile devices. This suggests that more research is needed to create agents that can work across different domains, like desktop and mobile.

To better understand the capabilities of these agents, the researchers conducted a "robustness analysis" by testing M3A on a range of task variations. They found that changes to the parameters of a task can significantly affect how difficult it is for an agent to complete. This highlights the importance of testing agents under diverse conditions, rather than just using a fixed set of tasks.

Overall, AndroidWorld provides a more realistic and flexible way to benchmark the progress of autonomous agents that can control computers to help humans. The research demonstrates that while progress is being made, there is still plenty of room for improvement to create agents that can truly be effective across a wide range of real-world tasks.

Technical Explanation

AndroidWorld is a fully functioning Android environment that provides reward signals for 116 programmatic task workflows across 20 real-world Android applications. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways. This enables testing on a much larger and realistic suite of tasks.

The reward signals in AndroidWorld are derived from the computer's system state, making them durable across task variations and extensible across different apps. To demonstrate AndroidWorld's benefits and mode of operation, the researchers introduced a new computer control agent called M3A, which can complete 30.6% of the AndroidWorld's tasks.

The researchers also adapted a popular desktop web agent to work on Android, which they found to be less effective on mobile, suggesting the need for future research to achieve universal, cross-domain agents. To assess the robustness of M3A, the researchers conducted a analysis by testing it against a range of task variations on a representative subset of tasks. They found that variations in task parameters can significantly alter the complexity of a task and therefore an agent's performance, highlighting the importance of testing agents under diverse conditions.

Critical Analysis

The AndroidWorld environment presented in this paper is a significant step forward in benchmarking the capabilities of autonomous agents that can control computers to perform human tasks. By dynamically generating a wide variety of parameterized tasks expressed in natural language, AndroidWorld provides a much more realistic and challenging test bed compared to previous interactive environments with static task sets.

However, the paper does acknowledge some limitations. The fact that the M3A agent could only complete 30.6% of AndroidWorld's tasks suggests there is still substantial room for improvement in developing agents with truly general and robust capabilities. The authors' finding that adapting a desktop web agent to work on mobile was less effective also highlights the difficulty in creating agents that can seamlessly operate across different domains.

Additionally, the robustness analysis revealed that variations in task parameters can greatly impact an agent's performance. This is an important insight, as it suggests that testing agents on a diverse set of conditions is crucial to accurately assess their capabilities. Relying on a fixed test set may not provide a complete picture of an agent's true abilities.

Future research could explore ways to make AndroidWorld even more representative of real-world scenarios, such as by incorporating additional types of applications or task workflows. Investigating the specific reasons why certain task variations are more challenging for agents could also yield valuable insights to guide further advancements in this field.

Overall, the AndroidWorld environment and the experiments presented in this paper represent an important step forward in the development of autonomous agents that can enhance human productivity and accessibility. However, the research also highlights the significant challenges that remain in creating truly capable and versatile agents that can reliably perform a wide range of tasks across different domains.

Conclusion

The AndroidWorld environment introduced in this paper provides a more realistic and flexible benchmark for testing autonomous agents that can control computers to perform human tasks. By dynamically generating a diverse set of parameterized tasks expressed in natural language, AndroidWorld enables a much more comprehensive assessment of an agent's capabilities compared to previous interactive environments with static test sets.

The experiments conducted by the researchers, including the introduction of the M3A agent and the adaptation of a desktop web agent to mobile, demonstrate both the progress and the limitations of current approaches. While M3A can complete a significant portion of AndroidWorld's tasks, the overall performance still leaves ample room for improvement, and the challenges of cross-domain agent development are evident.

The robustness analysis, which found that variations in task parameters can significantly impact an agent's performance, underscores the importance of testing agents under diverse conditions. Relying on a fixed set of tasks may not provide an accurate representation of an agent's true capabilities.

Overall, the AndroidWorld environment and the research presented in this paper represent an important step forward in the quest to develop autonomous agents that can enhance human productivity and application accessibility. However, the work also highlights the substantial challenges that remain in creating truly capable and versatile agents that can reliably perform a wide range of tasks across different domains. Continued research and innovation in this field will be crucial to realizing the full potential of human-agent collaboration.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👨‍🏫

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva

Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable across task variations and extensible across different apps. To demonstrate AndroidWorld's benefits and mode of operation, we introduce a new computer control agent, M3A. M3A can complete 30.6% of the AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-domain agents. Finally, we conduct a robustness analysis by testing M3A against a range of task variations on a representative subset of tasks, demonstrating that variations in task parameters can significantly alter a task's complexity and, consequently, an agent's performance, highlighting the importance of testing agents under diverse conditions. AndroidWorld and the experiments in this paper are available at https://github.com/google-research/android_world.

6/11/2024

🌐

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $textbf{AppWorld Engine}$, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $textbf{AppWorld Benchmark}$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.

7/29/2024

🏅

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

5/31/2024

⚙️

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

4/17/2024