AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Read original: arXiv:2407.18901 - Published 7/29/2024 by Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian

🌐

Overview

The paper describes the development of a high-quality execution environment called AppWorld Engine and a benchmark suite called AppWorld Benchmark to assess the capabilities of autonomous agents that perform day-to-day digital tasks.
Existing benchmarks are inadequate as they only cover tasks that require a simple sequence of API calls, while real-world tasks often require complex code generation and interaction with multiple applications.
The AppWorld Engine simulates the lives of ~100 fictitious users and their digital activities across 9 common apps, accessible via 457 APIs.
The AppWorld Benchmark contains 750 diverse and challenging tasks that require rich and interactive code generation by autonomous agents.

Plain English Explanation

Imagine you have a digital assistant that can help you with your everyday tasks, like ordering groceries, managing your calendar, and sending messages. To do these tasks, the assistant needs to be able to use multiple apps and services, and generate complex code that can adapt to different situations.

However, the current tests and benchmarks for these types of digital assistants are not very good. They only focus on tasks that involve a simple sequence of steps, like calling an API to place an order. They don't really test the assistant's ability to handle more complex, real-world situations.

To address this, the researchers created the AppWorld Engine and the AppWorld Benchmark. The AppWorld Engine is a detailed simulation of the digital lives of around 100 people, with 9 different apps and 457 APIs that the assistant can use. The AppWorld Benchmark then presents the assistant with 750 diverse and challenging tasks that require the assistant to generate complex code and adapt to different situations.

By testing digital assistants on the AppWorld Benchmark, the researchers can get a better idea of their true capabilities and limitations. The results so far show that even the most advanced language models, like GPT-4, can only solve about half of the 'normal' tasks and 30% of the 'challenge' tasks. This highlights the difficulty of the benchmark and the need for further advancements in interactive coding agents.

Technical Explanation

The researchers developed the AppWorld Engine, a high-quality execution environment with 60,000 lines of code, that simulates the digital lives of ~100 fictitious users across 9 common apps, accessible via 457 APIs. This provides a realistic and diverse set of digital activities for autonomous agents to interact with.

They then created the AppWorld Benchmark, a suite of 750 natural, diverse, and challenging tasks that require autonomous agents to generate rich and interactive code. The benchmark supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes or collateral damage.

The researchers evaluated the state-of-the-art language model, GPT-4, on the AppWorld Benchmark and found that it could only solve ~49% of the 'normal' tasks and ~30% of the 'challenge' tasks. Other models performed even worse, solving at least 16% fewer tasks. This highlights the difficulty of the benchmark and the need for further advancements in interactive coding agents that can handle complex, real-world tasks involving multiple applications and dynamic environments.

Critical Analysis

The researchers acknowledge that the AppWorld Benchmark is a significant challenge for current state-of-the-art language models, as it requires agents to generate rich and interactive code based on their interaction with a diverse set of digital applications and activities.

However, the paper does not explore the specific reasons why these models struggle with the benchmark tasks. It would be helpful to understand the particular capabilities or limitations of the models that lead to their relatively poor performance, as this could inform future research directions.

Additionally, the paper does not discuss the potential biases or limitations of the AppWorld Engine and AppWorld Benchmark themselves. It's possible that the simulation and tasks may not fully capture the complexity and diversity of real-world digital activities, which could affect the generalizability of the results.

Further research could also explore the potential applications and use cases of the AppWorld Engine and AppWorld Benchmark, beyond just evaluating autonomous agent capabilities. For example, they could be used to train and develop more robust interactive coding agents or to study human-AI collaboration in complex, multi-app digital environments.

Conclusion

The AppWorld Engine and AppWorld Benchmark represent a significant advancement in the field of autonomous agent evaluation, as they provide a more realistic and diverse set of digital tasks that require complex code generation and interaction with multiple applications.

The benchmark results highlight the limitations of current state-of-the-art language models in solving these types of real-world, interactive tasks, and suggest that further advancements in interactive coding agents are needed to address the challenges posed by the AppWorld Benchmark.

Overall, the AppWorld project has the potential to push the frontiers of autonomous agent research and development, and could lead to the creation of more capable and versatile digital assistants that can better support our day-to-day digital lives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $textbf{AppWorld Engine}$, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $textbf{AppWorld Benchmark}$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.

7/29/2024

👨‍🏫

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva

Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable across task variations and extensible across different apps. To demonstrate AndroidWorld's benefits and mode of operation, we introduce a new computer control agent, M3A. M3A can complete 30.6% of the AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-domain agents. Finally, we conduct a robustness analysis by testing M3A against a range of task variations on a representative subset of tasks, demonstrating that variations in task parameters can significantly alter a task's complexity and, consequently, an agent's performance, highlighting the importance of testing agents under diverse conditions. AndroidWorld and the experiments in this paper are available at https://github.com/google-research/android_world.

6/11/2024

🏅

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

5/31/2024

🤔

WebApp1K: A Practical Code-Generation Benchmark for Web App Development

Yi Cui

We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The benchmark is lightweight and easy to run. We present the initial version of WebApp1K, and share our findings of running the benchmark against the latest frontier LLMs. First, open source LLMs deliver impressive performance, closely trailing behind GPT-4o and Claude 3.5. Second, model size has strong correlation with code correctness. Third, no prompting techniques have been found to lift performance either universally to all models, or significantly to a single model.

8/2/2024