WebCanvas: Benchmarking Web Agents in Online Environments

Read original: arXiv:2406.12373 - Published 7/17/2024 by Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu and 1 other

🐍

Overview

This paper introduces WebCanvas, a novel evaluation framework for web agents that addresses the dynamic nature of web interactions.
WebCanvas includes a new evaluation metric, a benchmark dataset called Mind2Web-Live, and tools for collecting and maintaining the dataset.
The paper also presents an open-source agent framework with extensible modules for reasoning, enabling the community to conduct online inference and evaluations.

Plain English Explanation

Web agents, or software programs designed to automate tasks on the web, need to be able to adapt to the constantly changing web environment. This includes adapting to frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web, making it difficult to assess how well web agents can handle these dynamic changes.

To address this gap, the researchers developed WebCanvas, an online evaluation framework that enables more realistic assessments of web agents. The key components of WebCanvas include:

A new evaluation metric that can reliably capture the critical intermediate actions or states necessary for task completion, while ignoring insignificant events or changes to web elements.
A benchmark dataset called Mind2Web-Live, which is a refined version of the original Mind2Web static dataset and contains 542 tasks with 2,439 intermediate evaluation states.
Lightweight and generalizable annotation tools and testing pipelines that allow the community to collect and maintain a high-quality, up-to-date dataset.

The researchers also open-sourced an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations.

The best-performing agent developed using this framework achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. The researchers also analyze the performance discrepancies across various websites, domains, and experimental environments.

Technical Explanation

The key technical aspects of the paper are as follows:

Evaluation Metric: The researchers developed a novel evaluation metric that can reliably capture the critical intermediate actions or states necessary for task completion, while ignoring noise caused by insignificant events or changes to web elements. This allows for a more accurate assessment of the agent's performance.
Benchmark Dataset: The Mind2Web-Live dataset is a refined version of the original Mind2Web static dataset, containing 542 tasks with 2,439 intermediate evaluation states. This dataset provides a more realistic and dynamic environment for evaluating web agents.
Annotation Tools and Testing Pipelines: The paper presents lightweight and generalizable annotation tools and testing pipelines that enable the community to collect and maintain a high-quality, up-to-date dataset for evaluating web agents. This allows for the continuous expansion and refinement of the benchmark.
Agent Framework: The researchers open-sourced an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. This framework can be used to develop and test web agents in a more realistic and dynamic environment.

Critical Analysis

The paper presents a valuable contribution to the field of web agent development and evaluation. By addressing the dynamic nature of the web, the researchers have taken an important step towards creating more practical and useful web agents. However, the relatively low task success and completion rates of the best-performing agent suggest that there is still significant room for improvement.

One potential limitation of the research is the scope of the Mind2Web-Live dataset. While it represents a more realistic and dynamic environment compared to previous benchmarks, it may not capture the full breadth of challenges faced by web agents in the real world. The researchers acknowledge this and encourage the community to further expand and refine the dataset.

Additionally, the paper does not provide a detailed analysis of the specific challenges or bottlenecks that the web agents faced in the evaluation. Understanding these issues could help guide future research and development efforts.

Overall, the WebCanvas framework and the insights provided in this paper represent a significant step forward in the field of web agent research and development. Continued efforts to improve the capabilities of web agents and expand the available benchmarks will be crucial for realizing their full potential in the ever-evolving web environment.

Conclusion

This paper introduces WebCanvas, an innovative online evaluation framework for web agents that addresses the dynamic nature of web interactions. WebCanvas includes a novel evaluation metric, a refined benchmark dataset called Mind2Web-Live, and tools for collecting and maintaining high-quality, up-to-date datasets.

The researchers also open-sourced an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. While the best-performing agent achieved relatively low task success and completion rates, this work represents an important step forward in the development of practical and useful web agents.

The paper encourages further contributions from the community to expand the available benchmarks and continue advancing this field of research. By addressing the challenges of the dynamic web environment, the WebCanvas framework and the insights presented in this paper have the potential to significantly impact the future of web agent technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🐍

WebCanvas: Benchmarking Web Agents in Online Environments

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu

For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.

7/17/2024

⚙️

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

4/17/2024

🏋️

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.

6/7/2024

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, L'eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, Alexandre Lacoste

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

7/24/2024