PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

2406.01587

Published 6/5/2024 by Yupeng Zheng, Zebin Xing, Qichao Zhang, Bu Jin, Pengfei Li, Yuhang Zheng, Zhongpu Xia, Kun Zhan, Xianpeng Lang, Yaran Chen and 1 other

cs.RO

PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

Abstract

Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle motion planning methods perform satisfactorily in common scenarios but struggle to generalize to long-tailed situations. Meanwhile, learning-based methods have yet to achieve superior performance over rule-based approaches in large-scale closed-loop scenarios. To address these issues, we propose PlanAgent, the first mid-to-mid planning system based on a Multi-modal Large Language Model (MLLM). MLLM is used as a cognitive agent to introduce human-like knowledge, interpretability, and common-sense reasoning into the closed-loop planning. Specifically, PlanAgent leverages the power of MLLM through three core modules. First, an Environment Transformation module constructs a Bird's Eye View (BEV) map and a lane-graph-based textual description from the environment as inputs. Second, a Reasoning Engine module introduces a hierarchical chain-of-thought from scene understanding to lateral and longitudinal motion instructions, culminating in planner code generation. Last, a Reflection module is integrated to simulate and evaluate the generated planner for reducing MLLM's uncertainty. PlanAgent is endowed with the common-sense reasoning and generalization capability of MLLM, which empowers it to effectively tackle both common and complex long-tailed scenarios. Our proposed PlanAgent is evaluated on the large-scale and challenging nuPlan benchmarks. A comprehensive set of experiments convincingly demonstrates that PlanAgent outperforms the existing state-of-the-art in the closed-loop motion planning task. Codes will be soon released.

Create account to get full access

Overview

Introduces a multi-modal large language agent called PlanAgent for closed-loop vehicle motion planning
Integrates computer vision, natural language processing, and reinforcement learning to enable autonomous driving capabilities
Demonstrates improved performance compared to prior approaches in challenging scenarios

Plain English Explanation

This paper presents a new AI system called PlanAgent that can handle vehicle motion planning for autonomous driving. PlanAgent combines different AI techniques, including computer vision, natural language processing, and reinforcement learning, to allow self-driving cars to navigate complex real-world driving situations.

The key innovation is that PlanAgent can understand the driving environment through both visual information (like camera images) and textual information (like driving instructions or descriptions). This allows it to make more informed and nuanced decisions about how to safely control the vehicle. For example, it can interpret traffic signs, understand verbal directions from a passenger, and learn from past driving experiences to plan an optimal route.

Importantly, the researchers show that PlanAgent outperforms previous autonomous driving systems, especially in challenging scenarios that require higher-level reasoning, like navigating construction zones or following a passenger's verbal guidance. This suggests that integrated multi-modal AI systems like PlanAgent could be an important step towards making self-driving cars more capable and reliable.

Technical Explanation

The PlanAgent system is built on a large language model that has been trained on a diverse corpus of textual data, including driving-related instructions, reports, and conversations. This allows it to understand natural language inputs and draw upon broad commonsense knowledge.

To incorporate visual perception, PlanAgent uses computer vision models to process camera images of the driving environment. It can detect and track relevant objects like other vehicles, pedestrians, and road signs. The language model and computer vision components are integrated through a multi-modal fusion module that allows information from both modalities to inform the vehicle's planning and control.

PlanAgent uses reinforcement learning to train an end-to-end policy for closed-loop motion planning. The agent learns to map its multi-modal observations (language inputs and visual scenes) to optimal driving actions that achieve goals like reaching a destination safely and efficiently. This allows the system to continuously improve its driving performance through experience.

Critical Analysis

The researchers acknowledge several limitations and areas for future work. For example, PlanAgent has only been evaluated in simulation, and its performance on real-world roads has not yet been demonstrated. There are also open questions about the system's robustness to sensor failures, adversarial attacks, and unexpected edge cases.

Additionally, the paper does not provide a detailed analysis of PlanAgent's inner workings or the specific technical innovations that enable its multi-modal reasoning capabilities. More transparency around the system's architecture and training process would allow for a deeper understanding of its strengths and weaknesses.

That said, the results presented in the paper are promising and suggest that integrated multi-modal AI systems could be a fruitful direction for autonomous driving research. Continued progress in this area could lead to self-driving cars that are more responsive, adaptable, and reliable in complex real-world environments.

Conclusion

The PlanAgent system demonstrates how combining computer vision, natural language processing, and reinforcement learning can enable more capable and versatile autonomous driving. By integrating multi-modal perception and reasoning, the system can navigate challenging driving scenarios that require higher-level understanding beyond just low-level vehicle control.

While there are still important challenges to address, the performance improvements shown in this paper suggest that multi-modal AI agents like PlanAgent could be a key step towards making self-driving cars a reality. As the technology continues to evolve, we may see autonomous vehicles that can better understand and respond to the nuances of human driving behavior and natural language guidance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Planning with Adaptive World Models for Autonomous Driving

Arun Balajee Vasudevan, Neehar Peri, Jeff Schneider, Deva Ramanan

Motion planning is crucial for safe navigation in complex urban environments. Historically, motion planners (MPs) have been evaluated with procedurally-generated simulators like CARLA. However, such synthetic benchmarks do not capture real-world multi-agent interactions. nuPlan, a recently released MP benchmark, addresses this limitation by augmenting real-world driving logs with closed-loop simulation logic, effectively turning the fixed dataset into a reactive simulator. We analyze the characteristics of nuPlan's recorded logs and find that each city has its own unique driving behaviors, suggesting that robust planners must adapt to different environments. We learn to model such unique behaviors with BehaviorNet, a graph convolutional neural network (GCNN) that predicts reactive agent behaviors using features derived from recently-observed agent histories; intuitively, some aggressive agents may tailgate lead vehicles, while others may not. To model such phenomena, BehaviorNet predicts parameters of an agent's motion controller rather than predicting its spacetime trajectory (as most forecasters do). Finally, we present AdaptiveDriver, a model-predictive control (MPC) based planner that unrolls different world models conditioned on BehaviorNet's predictions. Our extensive experiments demonstrate that AdaptiveDriver achieves state-of-the-art results on the nuPlan closed-loop planning benchmark, reducing test error from 6.4% to 4.6%, even when applied to never-before-seen cities.

6/18/2024

cs.RO cs.LG

Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Yuan Chen, Zi-han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, Si Liu

Despite real-time planners exhibiting remarkable performance in autonomous driving, the growing exploration of Large Language Models (LLMs) has opened avenues for enhancing the interpretability and controllability of motion planning. Nevertheless, LLM-based planners continue to encounter significant challenges, including elevated resource consumption and extended inference times, which pose substantial obstacles to practical deployment. In light of these challenges, we introduce AsyncDriver, a new asynchronous LLM-enhanced closed-loop framework designed to leverage scene-associated instruction features produced by LLM to guide real-time planners in making precise and controllable trajectory predictions. On one hand, our method highlights the prowess of LLMs in comprehending and reasoning with vectorized scene data and a series of routing instructions, demonstrating its effective assistance to real-time planners. On the other hand, the proposed framework decouples the inference processes of the LLM and real-time planners. By capitalizing on the asynchronous nature of their inference frequencies, our approach have successfully reduced the computational cost introduced by LLM, while maintaining comparable performance. Experiments show that our approach achieves superior closed-loop evaluation performance on nuPlan's challenging scenarios.

6/24/2024

cs.RO cs.CV

Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?

Marcel Hallgarten, Julian Zapata, Martin Stoll, Katrin Renz, Andreas Zell

Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios. Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop). In particular, nuPlan seems to be an expressive evaluation method since it is based on real-world data and closed-loop, yet it mostly covers basic driving scenarios. This makes it difficult to judge a planner's capabilities to generalize to rarely-seen situations. Therefore, we propose a novel closed-loop benchmark interPlan containing several edge cases and challenging driving scenarios. We assess existing state-of-the-art planners on our benchmark and show that neither rule-based nor learning-based planners can safely navigate the interPlan scenarios. A recently evolving direction is the usage of foundation models like large language models (LLM) to handle generalization. We evaluate an LLM-only planner and introduce a novel hybrid planner that combines an LLM-based behavior planner with a rule-based motion planner that achieves state-of-the-art performance on our benchmark.

4/12/2024

cs.RO cs.AI cs.LG

Ask-before-Plan: Proactive Language Agents for Real-World Planning

Xuan Zhang, Yang Deng, Zifeng Ren, See-Kiong Ng, Tat-Seng Chua

The evolution of large language models (LLMs) has enhanced the planning capabilities of language agents in diverse real-world scenarios. Despite these advancements, the potential of LLM-powered agents to comprehend ambiguous user instructions for reasoning and decision-making is still under exploration. In this work, we introduce a new task, Proactive Agent Planning, which requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction, invoke external tools to collect valid information, and generate a plan to fulfill the user's demands. To study this practical problem, we establish a new benchmark dataset, Ask-before-Plan. To tackle the deficiency of LLMs in proactive planning, we propose a novel multi-agent framework, Clarification-Execution-Planning (texttt{CEP}), which consists of three agents specialized in clarification, execution, and planning. We introduce the trajectory tuning scheme for the clarification agent and static execution agent, as well as the memory recollection mechanism for the dynamic execution agent. Extensive evaluations and comprehensive analyses conducted on the Ask-before-Plan dataset validate the effectiveness of our proposed framework.

6/19/2024

cs.CL cs.AI