WebArena: A Realistic Web Environment for Building Autonomous Agents

Read original: arXiv:2307.13854 - Published 4/17/2024 by Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried and 2 others

⚙️

Overview

This paper explores the potential for autonomous agents to manage daily tasks using natural language commands.
However, current agents are primarily tested in simplified synthetic environments, leading to a disconnect with real-world scenarios.
The researchers have built an environment called WebArena that is highly realistic and reproducible, focusing on agents that perform tasks on the web.
The environment includes fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management.
The environment is enriched with tools and external knowledge bases to encourage human-like task-solving.
The researchers have released a set of benchmark tasks to evaluate the functional correctness of task completions.
The tasks are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet.
Baseline agents, including those integrating recent techniques like reasoning before acting, were tested, but the results show that solving these complex tasks is challenging.

Plain English Explanation

As artificial intelligence (AI) models have become more advanced, there is now potential for AI-powered "agents" to help us with our daily tasks through natural language commands, similar to how we might ask a personal assistant for help. However, the current agents are primarily tested in simplified, artificial environments, which doesn't reflect the real-world complexity that humans face.

To address this, the researchers have created a new environment called WebArena that is much more realistic and reflects the kind of tasks we perform on the internet every day. This includes creating fully functional websites across different domains, like online shopping, social media, software development, and content management. The environment also provides helpful tools and information, just like a human would have access to when trying to complete a task.

Using this new environment, the researchers have developed a set of challenging tasks that mimic the kind of things we might need to do online, like finding a specific product, joining a discussion, or contributing to a software project. They've tested several different AI agents, including some that use advanced techniques like "reasoning before acting," to see how well they can complete these tasks.

The results show that even with these cutting-edge AI models, solving these complex, real-world tasks is still very difficult. The best-performing agent was only able to successfully complete the tasks about 14% of the time, compared to 78% for humans. This highlights that there is still a lot of work to be done to create AI agents that can truly understand and navigate the complexities of the internet and the real world.

Technical Explanation

The researchers have developed an environment called WebArena that aims to bridge the gap between current AI agents, which are primarily tested in simplified synthetic environments, and the real-world complexity of tasks performed on the internet.

The WebArena environment includes fully functional websites across four common domains: e-commerce, social forum discussions, collaborative software development, and content management. These websites are designed to be highly realistic and reproducible, providing a test bed for language-guided agents to perform a diverse set of long-horizon tasks.

To encourage human-like task-solving, the WebArena environment is enriched with various tools (e.g., a map) and external knowledge bases (e.g., user manuals) that agents can utilize. Building upon this environment, the researchers have released a set of benchmark tasks that emulate the types of tasks humans routinely perform on the internet, such as finding a specific product, joining a discussion, or contributing to a software project.

The researchers have experimented with several baseline agents, including those that integrate recent techniques like reasoning before acting and bootstrapping large language models. However, the results demonstrate that solving these complex, real-world tasks is still a significant challenge for current state-of-the-art AI systems.

The best-performing GPT-4-based agent achieved an end-to-end task success rate of only 14.41%, significantly lower than the human performance of 78.24%. These findings highlight the need for further development of robust, multimodal Internet agents that can reliably navigate and complete tasks in the complex, open-ended world of the internet.

Critical Analysis

The researchers have made a valuable contribution by developing the WebArena environment, which provides a more realistic and reproducible testbed for evaluating language-guided agents in web-based tasks. This is an important step forward, as current agents are primarily tested in simplified synthetic environments that do not capture the full complexity of real-world scenarios.

However, the paper also highlights the significant challenges that current state-of-the-art AI systems face when tackling these complex, long-horizon tasks. The low success rate of the best-performing agent (14.41%) compared to human performance (78.24%) suggests that there is still a long way to go before AI can match human-level capabilities in these types of open-ended, real-world tasks.

The researchers acknowledge some of the limitations of their study, such as the potential for bias in the task design and the need for further research to understand the specific factors contributing to the performance gap between AI and humans. Additionally, the paper does not delve into the potential reasons why the baseline agents struggled, which could provide valuable insights for future research and development.

It would be interesting to see further analysis of the types of errors or failures the agents encountered, as well as the specific capabilities or limitations of the different techniques (e.g., reasoning before acting) that were integrated into the baseline agents. This could help guide the development of more robust and capable AI systems for these types of real-world tasks.

Conclusion

The research presented in this paper highlights the potential for autonomous agents to assist humans with daily tasks using natural language commands, but also underscores the significant challenges that current AI systems face when operating in complex, real-world environments like the internet.

By developing the WebArena environment and a set of diverse, long-horizon benchmark tasks, the researchers have provided a valuable tool for measuring the progress of language-guided agents in web-based scenarios. The low success rates of the baseline agents, even when integrating recent techniques, suggest that there is still much work to be done to create AI systems that can reliably and effectively navigate the complexities of the internet and the real world.

This research serves as a call to action for the AI research community to continue pushing the boundaries of what is possible, with the ultimate goal of developing agents that can truly assist and empower humans in their daily lives. The insights and challenges uncovered in this paper will undoubtedly inform and inspire future research in this exciting and rapidly evolving field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

4/17/2024

🏋️

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.

6/7/2024

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, L'eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, Alexandre Lacoste

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

7/24/2024

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

7/23/2024