DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

2402.12289

Published 6/26/2024 by Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

cs.CV

👁️

Abstract

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

Create account to get full access

Overview

Autonomous driving in urban environments faces a key challenge: understanding complex and unpredictable scenarios, such as difficult road conditions and delicate human behaviors.
The paper introduces DriveVLM, an autonomous driving system that leverages Vision-Language Models (VLMs) to enhance scene understanding and planning capabilities.
DriveVLM integrates reasoning modules for scene description, scene analysis, and hierarchical planning.
Recognizing VLM limitations in spatial reasoning and computational requirements, the authors propose DriveVLM-Dual, a hybrid system that combines the strengths of DriveVLM with traditional autonomous driving pipelines.
Experiments on the nuScenes dataset and the authors' own SUP-AD dataset demonstrate the effectiveness of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions.
The DriveVLM-Dual system is deployed on a production vehicle, verifying its real-world effectiveness in autonomous driving environments.

Plain English Explanation

Autonomous driving systems often struggle to understand the full complexity of urban environments. Complicated road conditions, such as construction or poor weather, and the unpredictable behaviors of pedestrians and other drivers can be difficult for these systems to navigate.

To address this challenge, the researchers developed DriveVLM, a new autonomous driving system that uses [object Object] to better comprehend the driving scene. VLMs are AI models that can understand and describe visual information using natural language.

DriveVLM integrates several key components:

Scene Description: The system can describe the driving environment using natural language, providing a more nuanced understanding.
Scene Analysis: DriveVLM analyzes the scene to identify important features and potential hazards.
Hierarchical Planning: The system plans a safe driving strategy at multiple levels, from high-level route planning to low-level vehicle control.

However, VLMs also have some limitations, such as difficulties with spatial reasoning and high computational requirements. To address these issues, the researchers developed a hybrid system called DriveVLM-Dual. This combines the strengths of DriveVLM with more traditional autonomous driving techniques.

The researchers tested both DriveVLM and DriveVLM-Dual on challenging driving datasets, including their own SUP-AD dataset. The results showed that these systems can effectively navigate complex and unpredictable driving conditions. Furthermore, they deployed the DriveVLM-Dual system on a production vehicle, demonstrating its real-world effectiveness.

Technical Explanation

The DriveVLM system integrates [object Object] to enhance the scene understanding and planning capabilities of autonomous driving. VLMs are a class of AI models that can jointly process visual and textual information, enabling them to describe images and videos using natural language.

DriveVLM's architecture includes several key components:

Scene Description: A VLM-based module that can generate natural language descriptions of the driving environment, providing a more nuanced understanding compared to traditional perception systems.
Scene Analysis: A reasoning module that analyzes the scene to identify important features, potential hazards, and other relevant information to inform the planning process.
Hierarchical Planning: A multi-level planning system that generates a safe driving strategy, from high-level route planning to low-level vehicle control.

To address the limitations of VLMs, such as their difficulties with spatial reasoning and high computational requirements, the authors propose DriveVLM-Dual. This hybrid system combines the strengths of DriveVLM with more traditional autonomous driving techniques, leveraging [object Object] and [object Object].

The researchers evaluate the performance of DriveVLM and DriveVLM-Dual on the nuScenes dataset and their own SUP-AD dataset, which contains challenging urban driving scenarios. The results demonstrate the systems' capabilities in handling complex and unpredictable driving conditions.

Furthermore, the authors deploy the DriveVLM-Dual system on a production vehicle, verifying its effectiveness in real-world autonomous driving environments. This integration of [object Object] with traditional autonomous driving pipelines represents a promising approach to enhance the robustness and performance of self-driving cars.

Critical Analysis

The DriveVLM and DriveVLM-Dual systems represent a valuable contribution to the field of autonomous driving, particularly in addressing the challenge of understanding complex and long-tail scenarios in urban environments. The integration of VLMs to enhance scene understanding and planning is an innovative approach, and the authors' recognition of VLM limitations and the development of a hybrid system is commendable.

However, the paper does not delve deeply into the technical details of the system's implementation, nor does it provide a comprehensive evaluation of its performance in a wider range of driving scenarios. The authors' own SUP-AD dataset, while helpful for testing, may not fully capture the diversity of urban driving conditions encountered in the real world.

Additionally, the paper does not address potential safety concerns or ethical considerations related to the deployment of such autonomous driving systems in public spaces. As these systems become more advanced and integrated into our transportation infrastructure, it will be crucial to ensure they adhere to rigorous safety standards and ethical principles.

Further research is needed to explore the long-term implications of VLM-based autonomous driving, including its impact on transportation infrastructure, urban planning, and the wider social and economic landscape. [object Object] and the development of robust, efficient, and interpretable vision-language models will also be crucial for the widespread adoption and trust in these technologies.

Conclusion

The DriveVLM and DriveVLM-Dual systems introduced in this paper represent a significant advancement in autonomous driving, leveraging [object Object] to enhance scene understanding and planning capabilities. By integrating scene description, analysis, and hierarchical planning, these systems demonstrate the potential of VLMs to navigate complex and unpredictable urban driving environments.

While the authors have made important strides, there remain opportunities for further research and development to address the limitations of VLMs, ensure robust performance, and consider the broader societal implications of VLM-based autonomous driving. As these technologies continue to evolve, it will be critical to prioritize safety, ethics, and the responsible integration of autonomous vehicles into our transportation ecosystem.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

Vision Language Models in Autonomous Driving: A Survey and Outlook

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, Alois C. Knoll

The applications of Vision-Language Models (VLMs) in the field of Autonomous Driving (AD) have attracted widespread attention due to their outstanding performance and the ability to leverage Large Language Models (LLMs). By incorporating language data, driving systems can gain a better understanding of real-world environments, thereby enhancing driving safety and efficiency. In this work, we present a comprehensive and systematic survey of the advances in vision language models in this domain, encompassing perception and understanding, navigation and planning, decision-making and control, end-to-end autonomous driving, and data generation. We introduce the mainstream VLM tasks in AD and the commonly utilized metrics. Additionally, we review current studies and applications in various areas and summarize the existing language-enhanced autonomous driving datasets thoroughly. Lastly, we discuss the benefits and challenges of VLMs in AD and provide researchers with the current research gaps and future trends.

6/26/2024

cs.CV cs.AI

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, Dit-Yan Yeung, Huchuan Lu, Xu Jia

Large Vision-Language Models (LVLMs) have received widespread attention in advancing the interpretable self-driving. Existing evaluations of LVLMs primarily focus on the multi-faceted capabilities in natural circumstances, lacking automated and quantifiable assessment for self-driving, let alone the severe road corner cases. In this paper, we propose CODA-LM, the very first benchmark for the automatic evaluation of LVLMs for self-driving corner cases. We adopt a hierarchical data structure to prompt powerful LVLMs to analyze complex driving scenes and generate high-quality pre-annotation for human annotators, and for LVLM evaluation, we show that using the text-only large language models (LLMs) as judges reveals even better alignment with human preferences than the LVLM judges. Moreover, with CODA-LM, we build CODA-VLM, a new driving LVLM surpassing all the open-sourced counterparts on CODA-LM. Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task. We hope CODA-LM can become the catalyst to promote interpretable self-driving empowered by LVLMs.

6/28/2024

cs.CV

CarLLaVA: Vision language models for camera-only closed-loop driving

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hunermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski

In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

6/17/2024

cs.CV cs.RO

🤔

Co-driver: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

Ziang Guo, Artem Lykov, Zakhar Yagudin, Mikhail Konenkov, Dzmitry Tsetserukou

Recent research about Large Language Model based autonomous driving solutions shows a promising picture in planning and control fields. However, heavy computational resources and hallucinations of Large Language Models continue to hinder the tasks of predicting precise trajectories and instructing control signals. To address this problem, we propose Co-driver, a novel autonomous driving assistant system to empower autonomous vehicles with adjustable driving behaviors based on the understanding of road scenes. A pipeline involving the CARLA simulator and Robot Operating System 2 (ROS2) verifying the effectiveness of our system is presented, utilizing a single Nvidia 4090 24G GPU while exploiting the capacity of textual output of the Visual Language Model. Besides, we also contribute a dataset containing an image set and a corresponding prompt set for fine-tuning the Visual Language Model module of our system. In the real-world driving dataset, our system achieved 96.16% success rate in night scenes and 89.7% in gloomy scenes regarding reasonable predictions. Our Co-driver dataset will be released at https://github.com/ZionGo6/Co-driver.

5/10/2024

cs.RO