$tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Read original: arXiv:2406.12045 - Published 6/19/2024 by Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

🖼️

Overview

Existing benchmarks do not test language agents on their ability to interact with human users or follow domain-specific rules, which are crucial for real-world deployments.
The authors propose a new benchmark called τ-Bench that simulates dynamic conversations between a user (represented by language models) and a language agent with access to domain-specific API tools and policy guidelines.
The evaluation process compares the final database state with the annotated goal state, and a new metric called "pass^k" is introduced to assess the reliability of agent behavior across multiple trials.
Experiments show that even state-of-the-art function-calling agents like GPT-4o succeed on less than 50% of the tasks and are quite inconsistent, highlighting the need for methods to improve agent consistency and rule-following abilities.

Plain English Explanation

Existing benchmarks for language agents, such as chatbots or virtual assistants, often don't test how well they can interact with real human users or follow specific rules and guidelines required for particular domains, like retail or customer service. This is a problem because these abilities are crucial for deploying these agents in the real world.

To address this, the researchers created a new benchmark called τ-Bench that simulates conversations between a user (represented by a language model) and a language agent. The agent has access to domain-specific tools and policies, just like a real-world virtual assistant would.

The researchers use an efficient process to evaluate the agent's performance by comparing the final state of the "database" (representing the outcome of the conversation) to the expected or "annotated" goal state. They also introduce a new way to measure how consistently the agent behaves across multiple trials, called the "pass^k" metric.

When they tested even the most advanced language agents, like GPT-4o, the agents succeeded on less than 50% of the tasks and were quite inconsistent, often failing to behave reliably over multiple trials. This suggests that there is still a lot of work to be done to improve the ability of these agents to consistently follow rules and guidelines, which is essential for using them in real-world applications.

Technical Explanation

The paper introduces τ-Bench, a new benchmark for evaluating language agents' ability to interact with human users and follow domain-specific rules. Unlike existing benchmarks that focus on open-ended language tasks, [τ-Bench] simulates dynamic conversations between a user (represented by a language model) and a language agent with access to domain-specific API tools and policy guidelines.

The evaluation process compares the final database state at the end of the conversation to the annotated goal state, providing an efficient and faithful way to assess the agent's performance. The authors also propose a new metric called "pass^k" that evaluates the reliability of the agent's behavior over multiple trials, addressing the inconsistency often seen in language agents.

Experiments show that even state-of-the-art function-calling agents like GPT-4o succeed on less than 50% of the tasks in [τ-Bench] and have a "pass^8" rate (the probability of passing 8 consecutive trials) of less than 25% in the retail domain. These findings highlight the need for methods that can improve agents' ability to act consistently and follow rules reliably, which is crucial for deploying them in real-world applications.

The [τ-Bench] benchmark builds on previous work in benchmarking language agents, evaluating large language models, and creating realistic web environments for autonomous agents, aiming to provide a more comprehensive and realistic assessment of language agents' capabilities.

Critical Analysis

The paper presents a valuable contribution by addressing the limitations of existing benchmarks and introducing a more realistic and domain-specific evaluation framework for language agents. The [τ-Bench] benchmark's focus on assessing an agent's ability to interact with users and follow rules is particularly important for real-world deployments, where these skills are essential.

However, the paper also acknowledges several caveats and limitations. The user simulation via language models may not fully capture the complexity and nuance of human interactions, and the domain-specific policies and APIs used in the benchmark may not be representative of all real-world applications. Additionally, the "pass^k" metric, while novel, may not provide a complete picture of an agent's overall performance and reliability.

Further research could explore ways to make the user simulation more sophisticated, incorporate a broader range of domain-specific scenarios, and develop complementary evaluation metrics to provide a more comprehensive assessment of language agents' capabilities. Investigating the underlying reasons for the inconsistent behavior observed in the experiments could also lead to insights for improving agent performance and reliability.

Conclusion

The [τ-Bench] benchmark proposed in this paper represents an important step forward in the objective evaluation of language agents' social intelligence and domain-specific capabilities. By simulating dynamic conversations between users and agents with access to real-world tools and policies, the benchmark provides a more realistic assessment of an agent's ability to interact effectively and follow rules consistently.

The experimental results, which show even state-of-the-art agents struggling to succeed on a majority of tasks, highlight the need for continued research and development to improve the reliability and rule-following abilities of these language agents. As these technologies become more prevalent in various applications, ensuring they can perform consistently and adhere to domain-specific guidelines will be crucial for their successful deployment and adoption.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

$tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

6/19/2024

GTA: A Benchmark for General Tool Agents

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, Xinyi Le

Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

7/12/2024

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang

Office automation significantly enhances human productivity by automatically finishing routine tasks in the workflow. Beyond the basic information extraction studied in much of the prior document AI literature, the office automation research should be extended to more realistic office tasks which require to integrate various information sources in the office system and produce outputs through a series of decision-making processes. We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic office workflows. OfficeBench requires LLM agents to perform feasible long-horizon planning, proficiently switch between applications in a timely manner, and accurately ground their actions within a large combined action space, based on the contextual demands of the workflow. Applying our customized evaluation methods on each task, we find that GPT-4 Omni achieves the highest pass rate of 47.00%, demonstrating a decent performance in handling office tasks. However, this is still far below the human performance and accuracy standards required by real-world office workflows. We further observe that most issues are related to operation redundancy and hallucinations, as well as limitations in switching between multiple applications, which may provide valuable insights for developing effective agent frameworks for office automation.

7/30/2024

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

7/23/2024