CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Read original: arXiv:2408.10845 - Published 8/21/2024 by Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, Issei Yamamoto

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Overview

Comprehensive Vision-Language-Action (CoVLA) dataset for autonomous driving
Includes multi-modal data: images, text, and vehicle actions
Aims to enable research on vision-language-action models for self-driving cars

Plain English Explanation

The CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving is a new dataset that was created to support research on self-driving cars. It contains a variety of data including images, text descriptions, and information about the actions taken by the vehicle.

The key idea behind this dataset is to provide a comprehensive resource for developing vision-language-action models that can help make self-driving cars more capable. These models aim to combine information from visual, textual, and action-based sources to enable autonomous vehicles to better understand and respond to complex driving situations.

By having access to this multi-modal data, researchers can train and evaluate more advanced AI systems for autonomous driving, going beyond just using camera images or sensor data alone. The dataset covers a wide range of real-world driving scenarios, which is important for developing robust and generalizable self-driving algorithms.

Technical Explanation

The CoVLA dataset was collected by equipping vehicles with cameras, LiDAR, and other sensors to capture multi-modal data during drives in diverse environments. This includes:

Visual data: High-resolution images from multiple cameras covering the vehicle's surroundings
Language data: Natural language descriptions of the driving scenes and actions
Action data: Vehicle control signals such as steering, throttle, and brake inputs

The dataset spans a variety of traffic scenarios, weather conditions, and road types, providing a comprehensive testbed for evaluating vision-language-action models for autonomous driving.

To ensure the quality and diversity of the data, the researchers employed multiple data collection techniques, including expert drivers, crowdsourcing, and automated processing. They also developed novel evaluation metrics to assess the performance of models trained on the CoVLA dataset.

Critical Analysis

The CoVLA dataset represents an important step forward in providing a comprehensive resource for developing advanced autonomous driving systems. By incorporating language and action data in addition to visual information, it enables the exploration of more holistic vision-language-action models that can better understand and navigate complex driving scenarios.

However, the paper acknowledges some limitations of the dataset, such as the potential for bias in the language descriptions and the challenge of scaling data collection to cover the full diversity of real-world driving. Additionally, the evaluation of these models on the CoVLA dataset still requires further research to ensure robust and reliable performance.

Conclusion

The CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving is a valuable contribution to the field of self-driving car research. By providing a multi-modal dataset that combines visual, textual, and action-based information, it opens up new avenues for developing advanced vision-language-action models that can push the boundaries of autonomous driving capabilities. As researchers continue to explore and improve upon these models, the CoVLA dataset will play a crucial role in driving progress towards safer and more capable self-driving vehicles.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, Issei Yamamoto

Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.

8/21/2024

👁️

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

6/26/2024

CarLLaVA: Vision language models for camera-only closed-loop driving

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hunermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski

In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

6/17/2024

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, Wenchao Ding

The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi-modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.

9/6/2024