Large Language Models as Planning Domain Generators

2405.06650

Published 5/14/2024 by James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, Shirin Sohrabi

💬

Abstract

Developing domain models is one of the few remaining places that require manual human labor in AI planning. Thus, in order to make planning more accessible, it is desirable to automate the process of domain model generation. To this end, we investigate if large language models (LLMs) can be used to generate planning domain models from simple textual descriptions. Specifically, we introduce a framework for automated evaluation of LLM-generated domains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains, and under three classes of natural language domain descriptions. Our results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions. Our code is available at https://github.com/IBM/NL2PDDL.

Create account to get full access

Overview

The paper investigates whether large language models (LLMs) can be used to generate planning domain models from simple textual descriptions, automating a process that has traditionally required manual human labor.
The researchers introduce a framework for automatically evaluating LLM-generated domains by comparing the sets of plans for domain instances.
The paper presents an empirical analysis of 7 large language models, including coding and chat models, across 9 different planning domains and under three classes of natural language domain descriptions.
The results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions.

Plain English Explanation

Planning is a crucial aspect of artificial intelligence (AI), but one of the few remaining tasks that still requires manual human effort is the development of domain models. These domain models are the blueprints that AI systems use to plan and make decisions.

The researchers in this paper wanted to see if they could automate this process of creating domain models by using large language models (LLMs). LLMs are AI systems that are trained on massive amounts of text data, allowing them to understand and generate human-like language.

The idea was to see if LLMs could take simple textual descriptions of a planning problem and automatically generate the corresponding domain model that an AI system could use to solve that problem. To evaluate this, the researchers developed a framework that compares the sets of plans that an AI system could generate using the LLM-created domain models versus the original, human-created domain models.

The researchers tested 7 different LLMs, including both coding-focused models and chat-focused models, across 9 different planning domains and 3 different styles of natural language descriptions. The results showed that the LLMs, especially the ones with the most parameters (and therefore the most "knowledge"), were able to generate domain models that were reasonably accurate, though not perfect.

Technical Explanation

The paper introduces a framework for the automated evaluation of LLM-generated planning domains. The key idea is to compare the sets of plans that can be generated using the LLM-created domain models versus the original, human-created domain models. If the sets of plans are similar, then the LLM-generated domain model is considered accurate.

The researchers tested 7 different LLMs, including the GPT-3 coding model and the InstructGPT chat model, across 9 different planning domains (e.g., Logistics, Gripper, Blocks World) and 3 different styles of natural language descriptions (short, medium, and long). They measured the accuracy of the LLM-generated domain models by calculating the F1 score between the sets of plans.

The results showed that the larger LLMs, with higher parameter counts, generally performed better than the smaller models. The best-performing model achieved an F1 score of around 0.6, indicating a moderate level of proficiency in generating accurate planning domain models from natural language descriptions.

Critical Analysis

The paper acknowledges several limitations and areas for further research. First, the evaluation framework relies on the assumption that the original, human-created domain models are "ground truth," which may not always be the case. Generating consistent PDDL domains using large language models may be a more complex task than the researchers suggest.

Additionally, the paper does not explore the potential for small language models to perform reasoning or the extent to which large language models can be considered learnable planners. These are important considerations that could provide a more nuanced understanding of the capabilities and limitations of LLMs in this domain.

Finally, the paper does not address the potential for LLMs to plan user's travels or other real-world applications of this technology. Further research in these areas could help to better understand the practical implications of the findings.

Conclusion

This paper demonstrates that large language models can be used to generate planning domain models from natural language descriptions, though the accuracy of the generated models is not perfect. The researchers have introduced a novel evaluation framework and provided an empirical analysis of several LLM models across multiple planning domains.

While the results are promising, the paper also highlights the need for further research to address the limitations and explore the potential applications of this technology. As AI systems become more sophisticated, the ability to automatically generate planning domain models from natural language descriptions could significantly improve the accessibility and usability of AI planning tools.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions

Elliot Gestrin, Marco Kuhlmann, Jendrik Seipp

Today's classical planners are powerful, but modeling input tasks in formats such as PDDL is tedious and error-prone. In contrast, planning with Large Language Models (LLMs) allows for almost any input text, but offers no guarantees on plan quality or even soundness. In an attempt to merge the best of these two approaches, some work has begun to use LLMs to automate parts of the PDDL creation process. However, these methods still require various degrees of expert input. We present NL2Plan, the first domain-agnostic offline LLM-driven planning system. NL2Plan uses an LLM to incrementally extract the necessary information from a short text prompt before creating a complete PDDL description of both the domain and the problem, which is finally solved by a classical planner. We evaluate NL2Plan on four planning domains and find that it solves 10 out of 15 tasks - a clear improvement over a plain chain-of-thought reasoning LLM approach, which only solves 2 tasks. Moreover, in two out of the five failure cases, instead of returning an invalid plan, NL2Plan reports that it failed to solve the task. In addition to using NL2Plan in end-to-end mode, users can inspect and correct all of its intermediate results, such as the PDDL representation, increasing explainability and making it an assistive tool for PDDL creation.

5/8/2024

cs.AI

Exploring and Benchmarking the Planning Capabilities of Large Language Models

Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, Hanie Sedghi

We seek to elevate the planning capabilities of Large Language Models (LLMs)investigating four main directions. First, we construct a comprehensive benchmark suite encompassing both classical planning domains and natural language scenarios. This suite includes algorithms to generate instances with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance. Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures. Finally, we investigate the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges.

6/21/2024

cs.CL cs.AI cs.LG

Generating consistent PDDL domains with Large Language Models

Pavel Smirnov, Frank Joublin, Antonello Ceravola, Michael Gienger

Large Language Models (LLMs) are capable of transforming natural language domain descriptions into plausibly looking PDDL markup. However, ensuring that actions are consistent within domains still remains a challenging task. In this paper we present a novel concept to significantly improve the quality of LLM-generated PDDL models by performing automated consistency checking during the generation process. Although the proposed consistency checking strategies still can't guarantee absolute correctness of generated models, they can serve as valuable source of feedback reducing the amount of correction efforts expected from a human in the loop. We demonstrate the capabilities of our error detection approach on a number of classical and custom planning domains (logistics, gripper, tyreworld, household, pizza).

4/12/2024

cs.RO cs.AI

Language Models can Infer Action Semantics for Classical Planners from Environment Feedback

Wang Zhu, Ishika Singh, Robin Jia, Jesse Thomason

Classical planning approaches guarantee finding a set of actions that can achieve a given goal state when possible, but require an expert to specify logical action semantics that govern the dynamics of the environment. Researchers have shown that Large Language Models (LLMs) can be used to directly infer planning steps based on commonsense knowledge and minimal domain information alone, but such plans often fail on execution. We bring together the strengths of classical planning and LLM commonsense inference to perform domain induction, learning and validating action pre- and post-conditions based on closed-loop interactions with the environment itself. We propose PSALM, which leverages LLM inference to heuristically complete partial plans emitted by a classical planner given partial domain knowledge, as well as to infer the semantic rules of the domain in a logical language based on environment feedback after execution. Our analysis on 7 environments shows that with just one expert-curated example plans, using LLMs as heuristic planners and rule predictors achieves lower environment execution steps and environment resets than random exploration while simultaneously recovering the underlying ground truth action semantics of the domain.

6/6/2024

cs.AI cs.CL cs.RO