NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

2406.04520

Published 6/10/2024 by Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi and 1 other

cs.CL cs.AI

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

Abstract

We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.

Create account to get full access

Overview

This paper, titled "Natural Plan: Benchmarking LLMs on Natural Language Planning", explores the ability of large language models (LLMs) to engage in natural language planning.
The researchers developed a new benchmark, called "Natural Plan", to assess the planning capabilities of LLMs across a variety of real-world planning tasks.
The paper presents the results of evaluating several state-of-the-art LLMs on the Natural Plan benchmark, providing insights into their planning abilities and limitations.

Plain English Explanation

The paper investigates how well large language models, which are AI systems trained on massive amounts of text data, can perform "planning" tasks. Planning is the process of coming up with a series of steps to achieve a goal, and it's an important skill for AI systems to have.

The researchers created a new benchmark called "Natural Plan" that tests the planning abilities of language models on a variety of real-world scenarios, like planning a vacation or organizing a party. This allows them to see how capable these language models are at tasks that require planning, reasoning, and decision-making, rather than just generating fluent text.

By testing several state-of-the-art language models on the Natural Plan benchmark, the researchers were able to identify the strengths and weaknesses of these models when it comes to planning. This provides valuable insights for developing more capable planning-aware AI systems in the future.

Technical Explanation

The paper introduces the "Natural Plan" benchmark, which consists of a diverse set of planning tasks that require language models to engage in reasoning, decision-making, and the generation of coherent, step-by-step plans. These tasks are grounded in real-world scenarios, such as planning a trip or organizing an event, and evaluate the models' ability to understand the problem, generate relevant steps, and produce a complete, logical plan.

The researchers evaluated several state-of-the-art language models, including GPT-3, InstructGPT, and Chinchilla, on the Natural Plan benchmark. They analyzed the models' performance across various metrics, such as plan completeness, coherence, and task-specific success rates. The results provided insights into the current capabilities and limitations of these language models when it comes to natural language planning.

Critical Analysis

The paper acknowledges that while the Natural Plan benchmark represents a significant step forward in evaluating the planning capabilities of language models, it still has some limitations. For example, the tasks are relatively constrained and may not fully capture the complexity of real-world planning challenges. Additionally, the researchers note that further work is needed to develop more robust planning-aware techniques that can handle a wider range of planning scenarios.

One potential area for improvement is to explore how language models can be better integrated with specialized planning algorithms to leverage their strengths in natural language understanding and generation while also incorporating more structured planning capabilities. This could lead to the development of more powerful and versatile planning-aware AI systems in the future.

Conclusion

This paper presents a significant step forward in the evaluation of large language models' planning abilities through the development of the Natural Plan benchmark. The results provide valuable insights into the current capabilities and limitations of state-of-the-art language models when it comes to natural language planning tasks. While the benchmark has its own limitations, the research paves the way for the continued development of more capable and planning-aware AI systems that can better understand and solve complex, real-world planning problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exploring and Benchmarking the Planning Capabilities of Large Language Models

Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, Hanie Sedghi

We seek to elevate the planning capabilities of Large Language Models (LLMs)investigating four main directions. First, we construct a comprehensive benchmark suite encompassing both classical planning domains and natural language scenarios. This suite includes algorithms to generate instances with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance. Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures. Finally, we investigate the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges.

6/21/2024

cs.CL cs.AI cs.LG

💬

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su

Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

6/26/2024

cs.CL

💬

What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

Eran Hirsch, Guy Uziel, Ateret Anaby-Tavor

Planning is a fundamental task in artificial intelligence that involves finding a sequence of actions that achieve a specified goal in a given environment. Large language models (LLMs) are increasingly used for applications that require planning capabilities, such as web or embodied agents. In line with recent studies, we demonstrate through experimentation that LLMs lack necessary skills required for planning. Based on these observations, we advocate for the potential of a hybrid approach that combines LLMs with classical planning methodology. Then, we introduce SimPlan, a novel hybrid-method, and evaluate its performance in a new challenging setup. Our extensive experiments across various planning domains demonstrate that SimPlan significantly outperforms existing LLM-based planners.

5/24/2024

cs.CL

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu

The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning? To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench. We have made all codes, data, and a maintained benchmark leaderboard available to advance future research.

6/12/2024

cs.CV cs.CL cs.RO