BAGEL: Bootstrapping Agents by Guiding Exploration with Language

Read original: arXiv:2403.08140 - Published 6/11/2024 by Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, Kenton Lee

💬

Overview

This paper presents a method called BAGEL for bootstrapping language model (LM) agents to perform actions in digital environments without human supervision.
LM agents often struggle to generalize to new environments without demonstrations from humans, so BAGEL aims to address this by converting a set of randomly explored trajectories or synthetic instructions into demonstrations.
BAGEL uses two noisy LM components - an LM labeler that converts trajectories into synthetic instructions, and a zero-shot LM agent that maps those instructions back into refined trajectories. By iterating this process, BAGEL can transform the initial distribution of trajectories into ones that are well-described by natural language.
The paper then shows how these BAGEL-generated demonstrations can be used to adapt a zero-shot LM agent at test time, resulting in improvements of over 2-13% absolute on ToolQA and MiniWob++ benchmarks, with up to 13x reduction in execution failures.

Plain English Explanation

Language models (LMs) like GPT-3 are powerful at understanding and generating human language, but they can struggle when it comes to actually taking actions in digital environments like web browsers or APIs. This is because LMs often fail to generalize beyond the training data they were exposed to.

To address this, the researchers developed a method called BAGEL that can bootstrap LM agents to perform tasks in new environments without needing human demonstrations. BAGEL works by taking a small set of randomly explored trajectories or synthetic instructions, and then iteratively refining them into more natural-sounding instructions that the LM agent can then follow.

The key idea is to use two different LM components in a loop: one that can convert trajectories into instructions, and another that can map instructions back into refined trajectories. By passing the instructions and trajectories back and forth between these two components, BAGEL is able to gradually transform the initial set of trajectories into ones that are well-described by natural language.

The researchers then show that these BAGEL-generated demonstrations can be used to adapt a zero-shot LM agent at test time, leading to significant improvements in performance on benchmarks like ToolQA and MiniWob++. This suggests that BAGEL could be a powerful way to enable LM agents to generalize beyond their training data and engage in more natural language-guided interactions.

Technical Explanation

The paper presents a method called BAGEL (Bootstrap Agent through Generative Equilibrium Learning) for bootstrapping language model (LM) agents to perform actions in digital environments without human supervision.

BAGEL consists of two key components: an LM labeler and a zero-shot LM agent. The LM labeler takes a trajectory (a sequence of actions performed in the environment) and generates a synthetic natural language instruction that describes that trajectory. The zero-shot LM agent, on the other hand, takes a natural language instruction and maps it to a refined trajectory.

BAGEL operates by iteratively running these two components in a loop. It starts with a seed set of randomly explored trajectories or synthetic instructions. The LM labeler then converts these trajectories into synthetic instructions, and the zero-shot LM agent maps those instructions back into refined trajectories. By repeating this process, BAGEL is able to transform the initial distribution of trajectories towards ones that are well-described by natural language.

The paper then shows how these BAGEL-generated demonstrations can be used to adapt a zero-shot LM agent at test time. Specifically, the researchers use in-context learning to fine-tune the LM agent on the retrieved BAGEL demonstrations, resulting in improvements of over 2-13% absolute on ToolQA and MiniWob++ benchmarks, with up to 13x reduction in execution failures.

Critical Analysis

The BAGEL method presented in this paper is a promising approach for enabling language model agents to generalize beyond their training data and engage in more natural language-guided interactions. By leveraging a bootstrapping process that iteratively refines trajectories and instructions, BAGEL is able to generate demonstrations that are well-aligned with natural language.

However, the paper does not fully address the potential limitations and caveats of this approach. For example, the reliance on two noisy LM components (the labeler and the zero-shot agent) could introduce compounding errors over the course of the iterative refinement process. Additionally, the performance improvements demonstrated on the ToolQA and MiniWob++ benchmarks, while significant, may not necessarily translate to real-world environments with more complex and diverse tasks.

Further research is needed to explore the scalability and robustness of the BAGEL method, as well as its applicability to a wider range of digital environments and tasks. Investigating ways to improve the stability and accuracy of the LM components, or exploring alternative bootstrapping strategies, could be fruitful avenues for future work.

Conclusion

The BAGEL method presented in this paper represents an important step towards enabling language model agents to better generalize and interact with digital environments through natural language. By bootstrapping demonstrations without human supervision, BAGEL can help bridge the gap between the impressive language understanding capabilities of LMs and their ability to take meaningful actions in the real world.

While the results are promising, further research is needed to address the potential limitations and expand the scope of the approach. Nonetheless, the core ideas behind BAGEL, such as the iterative refinement of instructions and trajectories, could have broader implications for the field of language-guided AI systems and their potential to benefit society.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

BAGEL: Bootstrapping Agents by Guiding Exploration with Language

Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, Kenton Lee

Following natural language instructions by executing actions in digital environments (e.g. web-browsers and REST APIs) is a challenging task for language model (LM) agents. Unfortunately, LM agents often fail to generalize to new environments without human demonstrations. This work presents BAGEL, a method for bootstrapping LM agents without human supervision. BAGEL converts a seed set of randomly explored trajectories or synthetic instructions, into demonstrations, via round-trips between two noisy LM components: an LM labeler which converts a trajectory into a synthetic instruction, and a zero-shot LM agent which maps the synthetic instruction into a refined trajectory. By performing these round-trips iteratively, BAGEL quickly converts the initial distribution of trajectories towards those that are well-described by natural language. We use BAGEL demonstrations to adapt a zero shot LM agent at test time via in-context learning over retrieved demonstrations, and find improvements of over 2-13% absolute on ToolQA and MiniWob++, with up to 13x reduction in execution failures.

6/11/2024

💬

Large Language Models Can Self-Improve At Web Agent Tasks

Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter

Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

5/31/2024

ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.

6/24/2024

💬

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang, Jure Leskovec

A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? To answer this question, we introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We then construct an agent that can perform ML experimentation based on ReAct framework. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgentBench with 37.5% average success rate. Our agents also display highly interpretable plans and actions. However, the success rates vary considerably; they span from 100% on well-established older datasets to as low as 0% on recent Kaggle challenges created potentially after the underlying LM was trained. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination. Our code is released at https://github.com/snap-stanford/MLAgentBench.

4/16/2024