Large Language Models Can Self-Improve At Web Agent Tasks

Read original: arXiv:2405.20309 - Published 5/31/2024 by Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter

💬

Overview

Researchers explore how large language models (LLMs) can self-improve their performance as agents in complex environments like web browsers.
They use the WebArena benchmark to assess agent performance in web navigation and task completion.
The goal is to see if LLMs can fine-tune on their own generated data to exceed their base performance as autonomous agents.

Plain English Explanation

Training AI agents to effectively navigate and perform actions in complex environments like web browsers has traditionally been challenging due to limited training data. However, recent research has shown that large language models (LLMs) can demonstrate some ability to navigate novel environments using just natural language instructions as a guide.

Additionally, studies have found that LLMs have the capability to improve their own performance by fine-tuning on data generated by the model itself. In this work, the researchers explore whether LLMs can leverage this self-improvement capability to enhance their performance as autonomous agents in complex, long-term tasks.

They use the WebArena benchmark as the environment, where an agent must navigate web pages and complete specified objectives. By fine-tuning the LLM on synthetic training data mixtures, the researchers are able to achieve a 31% improvement in task completion rate over the base model.

The researchers also contribute new evaluation metrics to assess the performance, robustness, and quality of the agent's trajectories in greater detail than just aggregate benchmark scores, providing a more comprehensive way to measure self-improvement.

Technical Explanation

The researchers investigate the extent to which large language models (LLMs) can self-improve their performance as autonomous agents in complex environments, specifically using the WebArena benchmark.

In WebArena, an agent must navigate web pages and perform actions to achieve a specified objective. The researchers explore fine-tuning the LLM on three distinct synthetic training data mixtures and evaluate the model's performance on the WebArena benchmark.

Through this self-improvement procedure, the researchers achieve a 31% improvement in task completion rate over the base LLM model. Additionally, they contribute novel evaluation metrics to assess the agent's performance, robustness, capabilities, and quality of trajectories in more detail than just simple, aggregate-level benchmark scores.

These new metrics provide a more comprehensive way to measure the self-improvement of the LLM-based autonomous agents, going beyond just the overall task completion rate.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their work. They note that the synthetic training data used for fine-tuning may not fully capture the complexity and nuance of real-world web navigation, which could limit the agent's performance in more realistic scenarios.

Additionally, the researchers suggest that further work is needed to understand the generalization capabilities of the self-improved agents and how they might perform on a wider range of web-based tasks beyond the specific WebArena benchmark.

Existing research has also highlighted the challenge of maintaining coherence and logical reasoning in LLM-based agents as they navigate complex, long-horizon tasks. The researchers in this paper do not directly address this issue, which could be an area for further investigation.

Conclusion

This research demonstrates the potential for large language models (LLMs) to self-improve their performance as autonomous agents in complex environments, such as web navigation. By fine-tuning on synthetic training data, the researchers were able to achieve a significant 31% improvement in task completion rate on the WebArena benchmark.

The introduction of novel evaluation metrics to assess agent performance, robustness, and trajectory quality provides a more comprehensive way to measure self-improvement, going beyond just aggregate-level benchmark scores.

These findings suggest that LLM-based multi-agent systems could become increasingly capable of navigating and completing tasks in real-world, web-based environments, with potential applications in areas like web automation, content curation, and digital assistance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Large Language Models Can Self-Improve At Web Agent Tasks

Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter

Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

5/31/2024

Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning

Lucas-Andrei Thil, Mirela Popa, Gerasimos Spanakis

Recent advancements in language models have demonstrated remarkable improvements in various natural language processing (NLP) tasks such as web navigation. Supervised learning (SL) approaches have achieved impressive performance while utilizing significantly less training data compared to previous methods. However, these SL-based models fall short when compared to reinforcement learning (RL) approaches, which have shown superior results. In this paper, we propose a novel approach that combines SL and RL techniques over the MiniWoB benchmark to leverage the strengths of both methods. We also address a critical limitation in previous models' understanding of HTML content, revealing a tendency to memorize target elements rather than comprehend the underlying structure. To rectify this, we propose methods to enhance true understanding and present a new baseline of results. Our experiments demonstrate that our approach outperforms previous SL methods on certain tasks using less data and narrows the performance gap with RL models, achieving 43.58% average accuracy in SL and 36.69% when combined with a multimodal RL approach. This study sets a new direction for future web navigation and offers insights into the limitations and potential of language modeling for computer tasks.

5/2/2024

Exploring Autonomous Agents through the Lens of Large Language Models: A Review

Saikat Barua

Large Language Models (LLMs) are transforming artificial intelligence, enabling autonomous agents to perform diverse tasks across various domains. These agents, proficient in human-like text comprehension and generation, have the potential to revolutionize sectors from customer service to healthcare. However, they face challenges such as multimodality, human value alignment, hallucinations, and evaluation. Techniques like prompting, reasoning, tool utilization, and in-context learning are being explored to enhance their capabilities. Evaluation platforms like AgentBench, WebArena, and ToolLLM provide robust methods for assessing these agents in complex scenarios. These advancements are leading to the development of more resilient and capable autonomous agents, anticipated to become integral in our digital lives, assisting in tasks from email responses to disease diagnosis. The future of AI, with LLMs at the forefront, is promising.

4/9/2024

💬

From Language Models to Practical Self-Improving Computer Agents

Alex Sheng

We develop a simple and straightforward methodology to create AI computer agents that can carry out diverse computer tasks and self-improve by developing tools and augmentations to enable themselves to solve increasingly complex tasks. As large language models (LLMs) have been shown to benefit from non-parametric augmentations, a significant body of recent work has focused on developing software that augments LLMs with various capabilities. Rather than manually developing static software to augment LLMs through human engineering effort, we propose that an LLM agent can systematically generate software to augment itself. We show, through a few case studies, that a minimal querying loop with appropriate prompt engineering allows an LLM to generate and use various augmentations, freely extending its own capabilities to carry out real-world computer tasks. Starting with only terminal access, we prompt an LLM agent to augment itself with retrieval, internet search, web navigation, and text editor capabilities. The agent effectively uses these various tools to solve problems including automated software development and web-based tasks.

4/19/2024