Lane Change Classification and Prediction with Action Recognition Networks

2208.11650

Published 4/11/2024 by Kai Liang, Jun Wang, Abhir Bhalerao

🏷️

Abstract

Anticipating lane change intentions of surrounding vehicles is crucial for efficient and safe driving decision making in an autonomous driving system. Previous works often adopt physical variables such as driving speed, acceleration and so forth for lane change classification. However, physical variables do not contain semantic information. Although 3D CNNs have been developing rapidly, the number of methods utilising action recognition models and appearance feature for lane change recognition is low, and they all require additional information to pre-process data. In this work, we propose an end-to-end framework including two action recognition methods for lane change recognition, using video data collected by cameras. Our method achieves the best lane change classification results using only the RGB video data of the PREVENTION dataset. Class activation maps demonstrate that action recognition models can efficiently extract lane change motions. A method to better extract motion clues is also proposed in this paper.

Create account to get full access

Overview

Anticipating the intentions of surrounding vehicles is crucial for safe and efficient autonomous driving.
Previous approaches have relied on physical variables like speed and acceleration, but these lack semantic information.
While 3D convolutional neural networks (CNNs) have advanced, few methods have used action recognition models and appearance features for lane change recognition.
This paper proposes an end-to-end framework with two action recognition methods for lane change recognition using video data from cameras.

Plain English Explanation

Autonomous driving systems need to be able to predict when other vehicles around them are going to change lanes. Previous research has often used physical measurements like speed and acceleration to try to figure this out. But these physical variables don't contain enough information about the meaning and context of what the other vehicles are doing.

Some researchers have started using 3D convolutional neural networks to analyze video footage and recognize lane changes. But not many have tried using "action recognition" models, which are designed to detect specific actions and movements in videos. And the ones that have tried this approach still needed additional information to prepare the data before running their models.

This paper proposes a new end-to-end system that uses two different action recognition methods to detect lane changes just from the regular video footage captured by cameras on the autonomous vehicle. The researchers show that this approach can accurately classify lane changes using only the video data, without needing any extra preprocessing. The paper also introduces a new technique to better extract clues about the motion of the other vehicles.

Technical Explanation

The proposed end-to-end framework includes two action recognition methods for lane change detection using only video data. The first method is based on ActNetFormer, a hybrid model that combines a transformer-based architecture with a ResNet backbone. The second method uses a more lightweight ENet-21 CNN structure.

Both models are trained on the PREVENTION dataset, which contains video footage of lane changes. The paper shows that the action recognition-based approaches outperform previous methods that relied on physical variables. Class activation maps demonstrate that the models are able to effectively extract the key motions and visual cues associated with lane changes.

Additionally, the researchers propose a new technique to better capture motion information from the video data, further improving the lane change classification performance.

Critical Analysis

The paper provides a compelling case for using action recognition models to tackle the problem of lane change prediction in autonomous driving. By focusing on the visual and motion patterns of lane changes, rather than just physical variables, the proposed methods are able to achieve strong classification results using only video data.

However, the paper does not deeply address potential limitations or drawbacks of this approach. For example, it's unclear how the models would perform in more complex driving scenarios with multiple lane changes, or how they would handle unusual or unexpected driving behaviors. Additionally, the reliance on camera data means the system could be vulnerable to environmental conditions that impact visibility.

Further research is needed to better understand the robustness and generalizability of action recognition-based lane change detection in real-world autonomous driving applications. Integrating these techniques with other environmental perception methods could also lead to more comprehensive and reliable lane change prediction capabilities.

Conclusion

This paper presents a novel approach to anticipating lane change intentions of surrounding vehicles in autonomous driving. By leveraging action recognition models trained on video data, the proposed end-to-end framework demonstrates improved lane change classification performance compared to traditional methods.

The ability to accurately predict lane changes using only camera footage, without relying on additional sensor data or complex preprocessing, is a promising step towards more efficient and safer autonomous driving systems. If further developed and validated, this technique could help autonomous vehicles better navigate complex traffic situations and make more informed, proactive decisions to enhance overall driving performance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤿

A Survey on Backbones for Deep Video Action Recognition

Zixuan Tang, Youjun Zhao, Yuhang Wen, Mengyuan Liu

Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.

5/10/2024

cs.CV cs.AI

DeepLocalization: Using change point detection for Temporal Action Localization

Mohammed Shaiqur Rahman, Ibne Farabi Shihab, Lynna Chu, Anuj Sharma

In this study, we introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior. Utilizing the power of advanced deep learning methodologies, our objective is to tackle the critical issue of distracted driving-a significant factor contributing to road accidents. Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities. Through careful prompt engineering, we customize the Video-LLM to adeptly handle driving activities' nuances, ensuring its classification efficacy even with sparse data. Engineered to be lightweight, our framework is optimized for consumer-grade GPUs, making it vastly applicable in practical scenarios. We subjected our method to rigorous testing on the SynDD2 dataset, a complex benchmark for distracted driving behaviors, where it demonstrated commendable performance-achieving 57.5% accuracy in event classification and 51% in event detection. These outcomes underscore the substantial promise of DeepLocalization in accurately identifying diverse driver behaviors and their temporal occurrences, all within the bounds of limited computational resources.

4/19/2024

cs.CV

🏷️

Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Ross Greer, Mathias Viborg Andersen, Andreas M{o}gelmose, Mohan Trivedi

Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.

4/24/2024

cs.CV cs.AI cs.LG

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: https://github.com/rana2149/ActNetFormer.

4/10/2024

cs.CV cs.AI cs.HC cs.LG cs.MM