DriveLM: Driving with Graph Visual Question Answering

Read original: arXiv:2312.14150 - Published 7/18/2024 by Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Bei{ss}wenger, Ping Luo, Andreas Geiger, Hongyang Li

📊

Overview

Researchers explore how vision-language models (VLMs) trained on web data can be integrated into end-to-end autonomous driving systems to improve generalization and enable interaction with human users.
While previous approaches have adapted VLMs for single-round visual question answering (VQA) in driving scenarios, this paper proposes a new "Graph VQA" task that aims to mimic the multi-step reasoning process of human drivers.
The authors create a new dataset (DriveLM-Data) and propose a VLM-based baseline model (DriveLM-Agent) that jointly performs Graph VQA and end-to-end driving.
Experiments show Graph VQA provides a principled framework for reasoning about driving scenes, and DriveLM-Agent performs competitively compared to specialized driving architectures, especially when evaluated on unseen objects or sensor configurations.

Plain English Explanation

The researchers in this paper are exploring how vision-language models (VLMs) - AI systems that can understand both visual and textual information - can be used to improve autonomous driving systems. VLMs are typically trained on huge amounts of web data, so the idea is that integrating them into driving systems could help those systems generalize better and even interact more naturally with human drivers.

Previous work has tried using VLMs for a single-step visual question answering (VQA) task in driving scenarios. But the researchers here realized that human drivers don't just answer one-off questions - they reason about driving decisions in multiple steps. First they identify key objects, then estimate how those objects might interact, and finally decide on an action.

So the researchers proposed a new "Graph VQA" task that tries to mimic this multi-step human reasoning process. They also created a new dataset (DriveLM-Data) based on driving simulation environments, and developed a VLM-based baseline model (DriveLM-Agent) that can do both Graph VQA and end-to-end autonomous driving.

The experiments show that the Graph VQA task provides a solid framework for reasoning about driving scenes. And the DriveLM-Agent model, while not the absolute best at pure driving, performs quite well compared to specialized driving architectures - especially when tested on situations it hasn't seen before, like unfamiliar objects or sensor setups.

Overall, this work takes an important step toward using general-purpose vision-language models to create more robust and adaptable autonomous driving systems. The researchers hope it will inspire further work in this direction.

Technical Explanation

The key technical contribution of this paper is the introduction of a "Graph VQA" task that aims to better capture the multi-step human reasoning process involved in driving decisions.

Typical VQA approaches in driving adapt VLMs to answer single-round questions about a scene. But the authors argue that human drivers don't just answer isolated questions - they engage in a more structured process of localizing key objects, estimating their interactions, and planning actions.

To model this, the Graph VQA task presents question-answer pairs that require the model to reason about the relationships and dynamics between different elements of the driving scene. The authors create a new dataset, DriveLM-Data, that provides these kinds of structured perception, prediction, and planning questions based on the nuScenes and CARLA driving simulation environments.

The paper also proposes a VLM-based baseline model called DriveLM-Agent that is trained to jointly perform Graph VQA and end-to-end autonomous driving. Experiments show this model can reason about driving scenes in a principled way through the Graph VQA task, and its driving performance is competitive with specialized architectures, especially when evaluated on unseen objects or sensor configurations.

Overall, the key insight is that by training VLMs on a structured reasoning task like Graph VQA, they can learn representations that are more suitable for autonomous driving compared to standard VQA. This provides a promising direction for applying general-purpose vision-language models to complex real-world control problems.

Critical Analysis

The authors make a compelling case for the benefits of their proposed Graph VQA task and DriveLM-Agent model. By aiming to mimic the multi-step human reasoning process, Graph VQA seems like a more suitable proxy for driving than standard VQA approaches.

That said, the paper does not provide a detailed analysis of the specific reasoning capabilities learned by DriveLM-Agent through Graph VQA. It would be interesting to see more qualitative examples or probing experiments to understand the model's internal representations and decision-making.

The authors also acknowledge that their DriveLM-Agent baseline, while performing well, is not the absolute state-of-the-art in end-to-end driving. There may be opportunities to further optimize the model architecture and training process to unlock even stronger driving performance.

Additionally, the DriveLM-Data dataset, while an important contribution, is still limited to simulated driving environments. Scaling these ideas to real-world driving scenarios with all their complexities and uncertainties remains an open challenge.

Overall, this paper takes an important step forward in exploring how general-purpose vision-language models can be leveraged for autonomous driving. But there is still much work to be done to fully realize the potential of this approach and ensure the safety and robustness of the resulting systems.

Conclusion

This paper explores a novel approach to integrating vision-language models (VLMs) into autonomous driving systems. By proposing a "Graph VQA" task that mimics the multi-step reasoning process of human drivers, the researchers have developed a principled framework for training VLMs to understand and reason about driving scenes.

The authors' DriveLM-Agent baseline model demonstrates the potential benefits of this approach, performing competitively on end-to-end driving while also excelling at the structured Graph VQA task - especially when tested on unseen scenarios.

While there are still many open challenges to address, this work represents an important step toward using general-purpose vision-language models to create more adaptable, generalizable, and interactive autonomous driving systems. The publicly available datasets and models will undoubtedly spur further research in this promising direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

DriveLM: Driving with Graph Visual Question Answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Bei{ss}wenger, Ping Luo, Andreas Geiger, Hongyang Li

We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen objects or sensor configurations. We hope this work can be the starting point to shed new light on how to apply VLMs for autonomous driving. To facilitate future research, all code, data, and models are available to the public.

7/18/2024

👁️

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

6/26/2024

📈

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu

Many fields could benefit from the rapid development of the large language models (LLMs). The end-to-end autonomous driving (e2eAD) is one of the typically fields facing new opportunities as the LLMs have supported more and more modalities. Here, by utilizing vision-language model (VLM), we proposed an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, which are perception, prediction, planning, and behavior. Each stage consists of several visual question answering (VQA) pairs and VQA pairs interconnect with each other constructing a graph called Graph VQA (GVQA). By reasoning each VQA pair in the GVQA through VLM stage by stage, our method could achieve e2e driving with language. In our method, vision transformers (ViT) models are employed to process nuScenes visual data, while VLM are utilized to interpret and reason about the information extracted from the visual inputs. In the perception stage, the system identifies and classifies objects from the driving environment. The prediction stage involves forecasting the potential movements of these objects. The planning stage utilizes the gathered information to develop a driving strategy, ensuring the safety and efficiency of the autonomous vehicle. Finally, the behavior stage translates the planned actions into executable commands for the vehicle. Our experiments demonstrate that SimpleLLM4AD achieves competitive performance in complex driving scenarios.

8/1/2024

⛏️

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang

Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.

7/23/2024