WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning

Read original: arXiv:2407.04281 - Published 7/8/2024 by Yiheng Li, Chongjian Ge, Chenran Li, Chenfeng Xu, Masayoshi Tomizuka, Chen Tang, Mingyu Ding, Wei Zhan

WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning

Overview

This paper introduces WOMD-Reasoning, a large-scale language dataset for interaction and driving intentions reasoning.
The dataset contains over 100,000 dialogues between drivers and passengers, along with annotations for driving intentions.
The goal is to enable the development of language models that can understand and reason about driving-related interactions and intentions.

Plain English Explanation

The researchers have created a new dataset called WOMD-Reasoning that aims to help develop AI systems that can better understand human language and reasoning in driving scenarios. The dataset contains over 100,000 conversations between drivers and passengers, where the conversations have been annotated to indicate the driving intentions of the people involved.

The key idea is that by having such a large and detailed dataset of driving-related dialogues, AI models can be trained to better comprehend and reason about the intentions and interactions that occur during driving. This could be very useful for applications like autonomous vehicles that need to understand human behavior and anticipate the actions of drivers and passengers.

Technical Explanation

The WOMD-Reasoning dataset was constructed by collecting over 100,000 dialogues between drivers and passengers in real-world driving situations. Each dialogue was then manually annotated to indicate the underlying driving intentions, such as changing lanes, merging, passing, and so on.

The dataset is designed to enable the development of language models that can understand and reason about the complex interactions that occur during driving. By training on this large and diverse dataset, the hope is that these models will be able to better comprehend driving-related language, anticipate driver and passenger intentions, and ultimately assist in developing more intelligent and responsive autonomous driving systems.

Critical Analysis

The WOMD-Reasoning dataset represents a significant contribution to the field of driving-related language understanding and reasoning. By providing a large-scale, annotated corpus of real-world driving dialogues, the researchers have created a valuable resource for training and evaluating advanced language models in this domain.

However, the dataset does have some potential limitations. The dialogues were collected in a relatively small geographic area, so the dataset may not fully capture the linguistic and cultural diversity of driving interactions across different regions. Additionally, the manual annotation process, while thorough, could introduce some subjectivity or inconsistencies in the labeling of driving intentions.

Further research could explore ways to expand the dataset's geographic and linguistic coverage, as well as investigate more automated or semi-automated approaches to the annotation process. Additionally, it would be interesting to see how the language models trained on WOMD-Reasoning perform on other driving-related tasks, such as multimodal reasoning or transfer learning to other domains.

Conclusion

The WOMD-Reasoning dataset represents an important step forward in the development of language-based technologies for understanding and reasoning about driving-related interactions and intentions. By providing a large-scale, annotated corpus of real-world driving dialogues, the researchers have created a valuable resource for training advanced language models that could ultimately contribute to the development of more intelligent and responsive autonomous driving systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning

Yiheng Li, Chongjian Ge, Chenran Li, Chenfeng Xu, Masayoshi Tomizuka, Chen Tang, Mingyu Ding, Wei Zhan

We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a language annotation dataset built on WOMD, with a focus on describing and reasoning interactions and intentions in driving scenarios. Previous language datasets primarily captured interactions caused by close distances. However, interactions induced by traffic rules and human intentions, which can occur over long distances, are yet sufficiently covered, despite being very common and more challenging for prediction or planning models to understand. Therefore, our WOMD-Reasoning focuses extensively on these interactions, providing a total of 409k Q&As for varying types of interactions. Additionally, WOMD-Reasoning presents by far the largest Q&A dataset on real-world driving scenarios, with around 3 million Q&As covering various topics of autonomous driving from map descriptions, motion status descriptions, to narratives and analyses of agents' interactions, behaviors, and intentions. This extensive textual information enables fine-tuning driving-related Large Language Models (LLMs) for a wide range of applications like scene description, prediction, planning, etc. By incorporating interaction and intention language from WOMD-Reasoning, we see significant enhancements in the performance of the state-of-the-art trajectory prediction model, Multipath++, with improvements of 10.14% in $MR_6$ and 6.90% in $minFDE_6$, proving the effectiveness of WOMD-Reasoning. We hope WOMD-Reasoning would empower LLMs in driving to offer better interaction understanding and behavioral reasoning. The dataset is available on https://waymo.com/open/download .

7/8/2024

⛏️

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang

Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.

7/23/2024

Making Large Language Models Better Planners with Reasoning-Decision Alignment

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, Xiaodan Liang

Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end AD systems. Specifically, our RDA-Driver achieves state-of-the-art planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate.

8/27/2024

🗣️

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christopher Arif Setiadharma, Jingkang Yang, Ziwei Liu

Multimodal information, together with our knowledge, help us to understand the complex and dynamic world. Large language models (LLM) and large multimodal models (LMM), however, still struggle to emulate this capability. In this paper, we present WorldQA, a video understanding dataset designed to push the boundaries of multimodal world models with three appealing properties: (1) Multimodal Inputs: The dataset comprises 1007 question-answer pairs and 303 videos, necessitating the analysis of both auditory and visual data for successful interpretation. (2) World Knowledge: We identify five essential types of world knowledge for question formulation. This approach challenges models to extend their capabilities beyond mere perception. (3) Long-Chain Reasoning: Our dataset introduces an average reasoning step of 4.45, notably surpassing other videoQA datasets. Furthermore, we introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain, thereby facilitating accurate responses to WorldQA queries. Extensive evaluations of 13 prominent LLMs and LMMs reveal that WorldRetriever, although being the most effective model, achieved only 70% of humanlevel performance in multiple-choice questions. This finding highlights the necessity for further advancement in the reasoning and comprehension abilities of models. Our experiments also yield several key insights. For instance, while humans tend to perform better with increased frames, current LMMs, including WorldRetriever, show diminished performance under similar conditions. We hope that WorldQA,our methodology, and these insights could contribute to the future development of multimodal world models.

5/7/2024