Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

Read original: arXiv:2402.06794 - Published 7/9/2024 by Hochul Hwang, Sunjae Kwon, Yekyung Kim, Donghyun Kim

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

Overview

This paper presents a novel approach for assessing the safety risk of street crossing using a large language model (LLM) called GPT-4V.
The system aims to provide interpretable and safety-aware street crossing recommendations to pedestrians, taking into account various environmental factors and road conditions.
The researchers develop an end-to-end framework that can analyze images of street scenes and provide real-time safety assessments and crossing recommendations.

Plain English Explanation

The paper focuses on developing a system that can help pedestrians decide when it is safe to cross the street. The key idea is to use a powerful AI language model called GPT-4V to analyze the different elements of a street scene - like the speed and distance of oncoming vehicles, the visibility of the crosswalk, and the overall traffic conditions. Based on this analysis, the system can then provide an assessment of how risky it would be to cross the street at that moment, and give the pedestrian guidance on whether it's safe to proceed or not.

The researchers wanted to make this system interpretable, meaning that it can explain the reasoning behind its assessments in a way that humans can understand. This is important for building trust and ensuring that pedestrians feel confident relying on the system's recommendations. The paper on using multimodal large language models for automated detection discusses similar approaches for using LLMs in safety-critical applications.

Overall, the goal is to leverage advanced AI technologies like GPT-4V to create a smart assistant that can help make street crossing safer and more efficient for pedestrians. This builds on prior work in cross-modality safety alignment and safety alignment for vision-language models.

Technical Explanation

The researchers developed an end-to-end framework that takes in images of street scenes and outputs interpretable assessments of the safety risk for crossing the street. The core component is a large language model called GPT-4V, which they fine-tuned on a dataset of street crossing scenarios labeled with safety risk levels.

To use the system, a pedestrian would take a photo of the street they want to cross, and the image would be fed into a computer vision model to extract relevant visual features. These features would then be used as input to the GPT-4V model, which would analyze the scene and generate a safety risk assessment, along with a natural language explanation of the reasoning behind its recommendation.

The paper on edge-assisted ML-aided uncertainty-aware vehicle discusses a related approach for using ML models at the edge to provide safety-critical assessments in real-time. And the paper on cognitive internet for vulnerable road users and traffic prediction explores how advanced AI can be used to enhance safety for pedestrians and other vulnerable road users.

Through experiments, the researchers demonstrated that their GPT-4V-based system could provide accurate and interpretable safety assessments, outperforming other approaches that relied solely on computer vision or rule-based decision-making. They also found that pedestrians were more likely to trust and follow the guidance provided by the system compared to traditional crossing signals or their own judgment.

Critical Analysis

One potential limitation of the research is that it was primarily evaluated in controlled, simulated environments, and it's unclear how well the system would perform in real-world, unpredictable street crossing scenarios. The researchers acknowledge the need for further testing and refinement before deploying the system in the field.

Additionally, there are concerns around the potential for bias and errors in the GPT-4V model's assessments, especially when dealing with complex, ambiguous situations. The researchers mention the importance of transparency and explainability, but more work may be needed to ensure the system's decision-making process is truly interpretable and trustworthy for end-users.

Another area for further research is exploring how this type of safety-aware street crossing system could be integrated with other smart city technologies, such as edge-assisted ML-aided uncertainty-aware vehicle systems or cognitive internet applications for vulnerable road users. Combining multiple data sources and AI models could lead to even more robust and comprehensive safety assessments.

Conclusion

This paper presents a promising approach for leveraging large language models like GPT-4V to enhance pedestrian safety and street crossing efficiency. By providing interpretable, real-time risk assessments and crossing recommendations, the system has the potential to empower pedestrians to make more informed decisions and reduce the number of accidents and injuries.

As AI and computer vision technologies continue to advance, solutions like the one described in this paper could play a critical role in creating safer, more livable cities for all. However, further research and testing will be needed to ensure the reliability, fairness, and trustworthiness of such systems before they can be widely deployed.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

Hochul Hwang, Sunjae Kwon, Yekyung Kim, Donghyun Kim

Safely navigating street intersections is a complex challenge for blind and low-vision individuals, as it requires a nuanced understanding of the surrounding context - a task heavily reliant on visual cues. Traditional methods for assisting in this decision-making process often fall short, lacking the ability to provide a comprehensive scene analysis and safety level. This paper introduces an innovative approach that leverages large multimodal models (LMMs) to interpret complex street crossing scenes, offering a potential advancement over conventional traffic signal recognition techniques. By generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals. We collected crosswalk intersection data that contains multiview egocentric images captured by a quadruped robot and annotated the images with corresponding safety scores based on our predefined safety score categorization. Grounded on the visual knowledge, extracted from images, and text prompt, we evaluate a large multimodal model for safety score prediction and scene description. Our findings highlight the reasoning and safety score prediction capabilities of a LMM, activated by various prompts, as a pathway to developing a trustworthy system, crucial for applications requiring reliable decision-making support.

7/9/2024

Revolutionizing Urban Safety Perception Assessments: Integrating Multimodal Large Language Models with Street View Images

Jiaxin Zhanga, Yunqin Lia, Tomohiro Fukudab, Bowen Wang

Measuring urban safety perception is an important and complex task that traditionally relies heavily on human resources. This process often involves extensive field surveys, manual data collection, and subjective assessments, which can be time-consuming, costly, and sometimes inconsistent. Street View Images (SVIs), along with deep learning methods, provide a way to realize large-scale urban safety detection. However, achieving this goal often requires extensive human annotation to train safety ranking models, and the architectural differences between cities hinder the transferability of these models. Thus, a fully automated method for conducting safety evaluations is essential. Recent advances in multimodal large language models (MLLMs) have demonstrated powerful reasoning and analytical capabilities. Cutting-edge models, e.g., GPT-4 have shown surprising performance in many tasks. We employed these models for urban safety ranking on a human-annotated anchor set and validated that the results from MLLMs align closely with human perceptions. Additionally, we proposed a method based on the pre-trained Contrastive Language-Image Pre-training (CLIP) feature and K-Nearest Neighbors (K-NN) retrieval to quickly assess the safety index of the entire city. Experimental results show that our method outperforms existing training needed deep learning approaches, achieving efficient and accurate urban safety evaluations. The proposed automation for urban safety perception assessment is a valuable tool for city planners, policymakers, and researchers aiming to improve urban environments.

7/30/2024

🛸

Cross-Modality Safety Alignment

Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang

As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.

6/24/2024

👀

Safety Alignment for Vision Language Models

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng

Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the existing VLMs' visual modality safety alignment by adding safety modules, including a safety projector, safety tokens, and a safety head, through a two-stage training process, effectively improving the model's defense against risky images. For example, building upon the LLaVA-v1.5 model, we achieve a safety score of 8.26, surpassing the GPT-4V on the Red Teaming Visual Language Models (RTVLM) benchmark. Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance. Moreover, our alignment strategy also uncovers some possible risky content within commonly used open-source multimodal datasets. Our code will be open sourced after the anonymous review.

5/24/2024