VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

Read original: arXiv:2404.00210 - Published 7/9/2024 by Daeun Song, Jing Liang, Amirreza Payandeh, Xuesu Xiao, Dinesh Manocha

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

Overview

• This paper explores a vision-language model approach to socially aware robot navigation, where the robot uses a vision-language model to score potential navigation paths based on their social awareness.

• The key idea is to use a pre-trained vision-language model to assess how "socially aware" a given navigation path would be, and then use that assessment to guide the robot's navigation decisions.

Plain English Explanation

• Robots need to be able to navigate through social environments, like crowded areas with people, in a way that is considerate and doesn't disrupt or inconvenience others. This is called "socially aware navigation."

• The researchers in this paper developed a system that uses a special type of artificial intelligence model called a "vision-language model" to help a robot evaluate different possible navigation paths and choose the one that is the most socially aware.

• Vision-language models are AI systems that can understand the relationship between visual information (like images) and language. In this case, the researchers used a pre-trained vision-language model to assess how "socially appropriate" different navigation paths would be, based on factors like how close the robot would come to people, whether it would block their path, etc.

• By using this vision-language model scoring approach, the robot can choose navigation paths that are more considerate and less disruptive to the people around it, compared to a robot that just tries to find the fastest or shortest path without considering the social implications.

• This could be very useful for robots operating in crowded, public spaces, like malls, airports, or city streets, where being socially aware is important for the robot to be able to navigate safely and without bothering people.

Technical Explanation

• The core of the system is a pre-trained vision-language model that is used to score the "social awareness" of potential navigation paths.

• The robot first generates a set of candidate navigation paths, using techniques like DriveDrivevlm or A3VLM.

• For each candidate path, the robot extracts visual features along the path and uses the pre-trained vision-language model to generate a "social awareness" score. This score reflects how considerate and appropriate the path is based on the robot's understanding of social norms.

• The robot then selects the navigation path with the highest social awareness score, using techniques like Learning Early Social Maneuvers to refine the path if needed.

• The experiments in the paper demonstrate that this vision-language model scoring approach leads to more socially aware navigation compared to traditional path planning methods that don't explicitly consider social factors.

Critical Analysis

• The paper acknowledges that the pre-trained vision-language model used may have some biases or limitations in its understanding of social norms, which could impact the robot's navigation decisions.

• Additionally, the paper notes that the system has only been tested in simulated environments so far, and more real-world validation would be needed to fully understand its performance and limitations.

• It would also be interesting to see how this approach could be combined with other techniques, such as the Memory Maze Scenario-Driven Benchmark, to further enhance the robot's social awareness and navigation capabilities.

Conclusion

• This paper presents a novel approach to socially aware robot navigation that leverages pre-trained vision-language models to score the social awareness of potential navigation paths.

• By considering social factors in the path planning process, this system can help robots navigate crowded, public spaces in a more considerate and less disruptive way, which could be valuable for applications in areas like service robotics, autonomous vehicles, and more.

• While the system has some limitations that require further research, this work represents an important step towards developing robots that can seamlessly and safely operate in complex social environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

Daeun Song, Jing Liang, Amirreza Payandeh, Xuesu Xiao, Dinesh Manocha

We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large training datasets and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 27.38% improvement in the average success rate and 19.05% improvement in the average collision rate in the four social navigation scenarios. Our user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.

7/9/2024

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang

Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

5/28/2024

Enhancing Socially-Aware Robot Navigation through Bidirectional Natural Language Conversation

Congcong Wen, Yifan Liu, Geeta Chandra Raju Bethala, Zheng Peng, Hui Lin, Yu-Shen Liu, Yi Fang

Robot navigation is an important research field with applications in various domains. However, traditional approaches often prioritize efficiency and obstacle avoidance, neglecting a nuanced understanding of human behavior or intent in shared spaces. With the rise of service robots, there's an increasing emphasis on endowing robots with the capability to navigate and interact in complex real-world environments. Socially aware navigation has recently become a key research area. However, existing work either predicts pedestrian movements or simply emits alert signals to pedestrians, falling short of facilitating genuine interactions between humans and robots. In this paper, we introduce the Hybrid Soft Actor-Critic with Large Language Model (HSAC-LLM), an innovative model designed for socially-aware navigation in robots. This model seamlessly integrates deep reinforcement learning with large language models, enabling it to predict both continuous and discrete actions for navigation. Notably, HSAC-LLM facilitates bidirectional interaction based on natural language with pedestrian models. When a potential collision with pedestrians is detected, the robot can initiate or respond to communications with pedestrians, obtaining and executing subsequent avoidance strategies. Experimental results in 2D simulation, the Gazebo environment, and the real-world environment demonstrate that HSAC-LLM not only efficiently enables interaction with humans but also exhibits superior performance in navigation and obstacle avoidance compared to state-of-the-art DRL algorithms. We believe this innovative paradigm opens up new avenues for effective and socially aware human-robot interactions in dynamic environments. Videos are available at https://hsacllm.github.io/.

9/10/2024

👁️

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

6/26/2024