Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Read original: arXiv:2404.14906 - Published 4/24/2024 by Ross Greer, Mathias Viborg Andersen, Andreas M{o}gelmose, Mohan Trivedi

🏷️

Overview

The paper presents a novel approach for driver activity classification, which is crucial for road safety and applications like driver assistance systems and autonomous vehicle control.
The method uses a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives and leverage pretrained vision-language encoders to generate robust, interpretable predictions.
The approach is evaluated on the Naturalistic Driving Action Recognition Dataset, demonstrating strong performance across diverse driver activities.

Plain English Explanation

Classifying driver activities is essential for improving road safety and developing advanced driver assistance systems and self-driving cars. This paper introduces a new method that uses a neural network to analyze video footage of drivers from multiple angles. The key innovation is the use of pre-trained vision-language models to extract meaningful representations from the video frames.

Instead of relying solely on the video data, the system also taps into the rich, contrastively-learned representations from these powerful language and vision models. By fusing this information, the researchers' Semantic Representation Late Fusion Neural Network (SRLF-Net) can accurately classify a wide range of driver behaviors, from turning to braking to reaching for something in the car.

The researchers tested their method on a real-world dataset of driver activities, and the results suggest that this approach offers both high accuracy and the ability to interpret the predictions in natural language terms. This could be valuable for monitoring driver attention and behavior to improve safety and enable smooth transitions between human and autonomous vehicle control.

Technical Explanation

The paper proposes a Semantic Representation Late Fusion Neural Network (SRLF-Net) for classifying driver activities from synchronized video footage captured from multiple perspectives. The key innovation is the use of pretrained vision-language models to extract rich, generalizable representations from the video frames.

Specifically, each video frame is encoded using a contrastively-trained vision-language model, such as CLIP or ALBEF. The resulting embeddings are then fused across the multiple camera views using a late fusion approach, which generates class probability predictions for the driver's current activity.

By leveraging these powerful, pre-trained vision-language representations, the SRLF-Net model can achieve robust performance across a diverse set of driver behaviors, as demonstrated on the Naturalistic Driving Action Recognition Dataset. The natural language descriptors provided by the vision-language models also offer interpretability, which could be valuable for applications like driver monitoring systems and autonomous vehicle control transitions.

Critical Analysis

The paper presents a promising approach for driver activity classification, but there are a few potential areas for further exploration. First, the evaluation is limited to a single dataset, so it would be valuable to assess the method's generalization across a wider range of driving conditions and environments.

Additionally, while the use of pre-trained vision-language representations offers advantages in terms of accuracy and interpretability, the paper does not provide a detailed analysis of the specific model choices and their impact on performance. Exploring alternative vision-language architectures or fine-tuning strategies could lead to further improvements.

Finally, the paper does not discuss potential limitations or ethical considerations around the use of such driver monitoring systems, such as privacy concerns or the risk of over-reliance on automation. These are important factors to consider as this technology matures and is deployed in real-world settings.

Conclusion

The presented approach leverages powerful vision-language representations to achieve robust and interpretable driver activity classification, with promising applications in driver assistance systems and autonomous vehicle control. By fusing multi-view video data with contrastively-learned semantic representations, the Semantic Representation Late Fusion Neural Network (SRLF-Net) demonstrates strong performance on a real-world dataset of diverse driving behaviors.

While further research is needed to assess the method's generalization and explore alternative architectures, this work suggests that vision-language models offer a compelling avenue for enhancing driver monitoring and safety, ultimately paving the way for more seamless human-machine collaboration on the roads.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Ross Greer, Mathias Viborg Andersen, Andreas M{o}gelmose, Mohan Trivedi

Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.

4/24/2024

Automated Vehicle Driver Monitoring Dataset from Real-World Scenarios

Mohamed Sabry, Walter Morales-Alvarez, Cristina Olaverri-Monreal

From SAE Level 3 of automation onwards, drivers are allowed to engage in activities that are not directly related to driving during their travel. However, in level 3, a misunderstanding of the capabilities of the system might lead drivers to engage in secondary tasks, which could impair their ability to react to challenging traffic situations. Anticipating driver activity allows for early detection of risky behaviors, to prevent accidents. To be able to predict the driver activity, a Deep Learning network needs to be trained on a dataset. However, the use of datasets based on simulation for training and the migration to real-world data for prediction has proven to be suboptimal. Hence, this paper presents a real-world driver activity dataset, openly accessible on IEEE Dataport, which encompasses various activities that occur in autonomous driving scenarios under various illumination and weather conditions. Results from the training process showed that the dataset provides an excellent benchmark for implementing models for driver activity recognition.

8/20/2024

🤔

Co-driver: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

Ziang Guo, Zakhar Yagudin, Artem Lykov, Mikhail Konenkov, Dzmitry Tsetserukou

Recent research on Large Language Models for autonomous driving shows promise in planning and control. However, high computational demands and hallucinations still challenge accurate trajectory prediction and control signal generation. Deterministic algorithms offer reliability but lack adaptability to complex driving scenarios and struggle with context and uncertainty. To address this problem, we propose VLM-Auto, a novel autonomous driving assistant system to empower the autonomous vehicles with adjustable driving behaviors based on the understanding of road scenes. A pipeline involving the CARLA simulator and Robot Operating System 2 (ROS2) verifying the effectiveness of our system is presented, utilizing a single Nvidia 4090 24G GPU while exploiting the capacity of textual output of the Visual Language Model (VLM). Besides, we also contribute a dataset containing an image set and a corresponding prompt set for fine-tuning the VLM module of our system. In CARLA experiments, our system achieved $97.82%$ average precision on 5 types of labels in our dataset. In the real-world driving dataset, our system achieved $96.97%$ prediction accuracy in night scenes and gloomy scenes. Our VLM-Auto dataset will be released at https://github.com/ZionGo6/VLM-Auto.

10/3/2024

N-DriverMotion: Driver motion learning and prediction using an event-based camera and directly trained spiking neural networks

Hyo Jong Chung, Byungkon Kang, Yoonseok Yang

Driver motion recognition is a principal factor in ensuring the safety of driving systems. This paper presents a novel system for learning and predicting driver motions and an event-based high-resolution (1280x720) dataset, N-DriverMotion, newly collected to train on a neuromorphic vision system. The system comprises an event-based camera that generates the first high-resolution driver motion dataset representing spike inputs and efficient spiking neural networks (SNNs) that are effective in training and predicting the driver's gestures. The event dataset consists of 13 driver motion categories classified by direction (front, side), illumination (bright, moderate, dark), and participant. A novel simplified four-layer convolutional spiking neural network (CSNN) that we proposed was directly trained using the high-resolution dataset without any time-consuming preprocessing. This enables efficient adaptation to on-device SNNs for real-time inference on high-resolution event-based streams. Compared with recent gesture recognition systems adopting neural networks for vision processing, the proposed neuromorphic vision system achieves comparable accuracy, 94.04%, in recognizing driver motions with the CSNN architecture. Our proposed CSNN and the dataset can be used to develop safer and more efficient driver monitoring systems for autonomous vehicles or edge devices requiring an efficient neural network architecture.

8/27/2024