ViPlanner: Visual Semantic Imperative Learning for Local Navigation

2310.00982

Published 5/24/2024 by Pascal Roth, Julian Nubert, Fan Yang, Mayank Mittal, Marco Hutter

⛏️

Abstract

Real-time path planning in outdoor environments still challenges modern robotic systems due to differences in terrain traversability, diverse obstacles, and the necessity for fast decision-making. Established approaches have primarily focused on geometric navigation solutions, which work well for structured geometric obstacles but have limitations regarding the semantic interpretation of different terrain types and their affordances. Moreover, these methods fail to identify traversable geometric occurrences, such as stairs. To overcome these issues, we introduce ViPlanner, a learned local path planning approach that generates local plans based on geometric and semantic information. The system is trained using the Imperative Learning paradigm, for which the network weights are optimized end-to-end based on the planning task objective. This optimization uses a differentiable formulation of a semantic costmap, which enables the planner to distinguish between the traversability of different terrains and accurately identify obstacles. The semantic information is represented in 30 classes using an RGB colorspace that can effectively encode the multiple levels of traversability. We show that the planner can adapt to diverse real-world environments without requiring any real-world training. In fact, the planner is trained purely in simulation, enabling a highly scalable training data generation. Experimental results demonstrate resistance to noise, zero-shot sim-to-real transfer, and a decrease of 38.02% in terms of traversability cost compared to purely geometric-based approaches. Code and models are made publicly available: https://github.com/leggedrobotics/viplanner.

Create account to get full access

Overview

This paper introduces ViPlanner, a learned local path planning approach that generates plans based on geometric and semantic information.
ViPlanner uses the Imperative Learning paradigm to optimize network weights end-to-end for the planning task objective.
The system leverages a differentiable formulation of a semantic costmap to distinguish between the traversability of different terrains and identify obstacles.
ViPlanner is trained purely in simulation, enabling scalable training data generation, and demonstrates zero-shot sim-to-real transfer.

Plain English Explanation

ViPlanner is a new approach to path planning for robots operating in outdoor environments. Traditional path planning methods have focused on geometric solutions, which work well for structured obstacles but struggle with interpreting the traversability of different terrain types. These methods also have difficulty identifying traversable features like stairs.

To address these limitations, ViPlanner uses machine learning to combine geometric and semantic information when generating local plans. The system is trained using a technique called Imperative Learning, which optimizes the neural network directly for the path planning task. This allows ViPlanner to learn to accurately assess the traversability of different terrains, such as grass, dirt, or stairs, and plan paths that avoid obstacles while prioritizing the most traversable routes.

Importantly, ViPlanner is trained entirely in simulation, which enables the generation of diverse training data without the need for extensive real-world data collection. Despite this, the system demonstrates the ability to transfer its capabilities to real-world environments without any additional training, a property known as "zero-shot sim-to-real transfer."

ViPlanner represents an important step forward in making robots more capable of navigating complex outdoor environments, as it allows them to better understand and plan around the different types of terrain they may encounter.

Technical Explanation

The key innovation of ViPlanner is its use of a differentiable semantic costmap to guide the path planning process. This costmap represents the traversability of different terrain types using a 30-class RGB colorspace, allowing the system to effectively encode multiple levels of traversability.

The ViPlanner architecture is trained end-to-end using the Imperative Learning paradigm, which optimizes the network weights directly for the planning task objective. This enables the system to learn to accurately assess the traversability of different terrains and plan paths that minimize the overall cost.

During training, ViPlanner is exposed to a diverse range of simulated environments, allowing it to learn robust representations of terrain traversability without the need for extensive real-world data collection. The authors demonstrate that this approach results in a 38.02% decrease in traversability cost compared to purely geometric-based planning methods, as well as the ability to transfer the system's capabilities to real-world environments without any additional training.

The ViPlanner system builds on previous work in areas such as visual navigation, semantic mapping, and robot-agnostic visual servoing, demonstrating how advancements in these fields can be leveraged to create more robust and capable navigation systems.

Critical Analysis

One potential limitation of ViPlanner is that it is primarily focused on local path planning, rather than global planning over longer distances. While the authors demonstrate the system's ability to navigate in complex outdoor environments, it may not be well-suited for tasks that require more extensive planning and decision-making.

Additionally, the authors mention that ViPlanner is trained using a differentiable formulation of the semantic costmap, but they do not provide details on how this formulation is implemented or how it compares to other potential representations of terrain traversability. Further exploration of the specific characteristics and trade-offs of this approach could be valuable.

Future research in this area could also investigate ways to combine ViPlanner's local planning capabilities with more global planning methods, potentially using techniques like hierarchical planning or semantic mapping, to create a more comprehensive navigation system.

Conclusion

ViPlanner represents a significant advancement in the field of robotic path planning, demonstrating the potential for machine learning-based approaches to overcome the limitations of traditional geometric navigation solutions. By leveraging semantic information and a differentiable costmap, ViPlanner is able to plan paths that accurately account for the traversability of different terrains, even in complex outdoor environments.

The system's ability to transfer its capabilities from simulation to the real world without any additional training is particularly impressive and highlights the potential for scalable, data-driven approaches to navigation planning. As robots continue to play an increasingly important role in a wide range of applications, innovations like ViPlanner will be essential for enabling them to operate effectively in challenging, unstructured environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📉

Wild Visual Navigation: Fast Traversability Learning via Pre-Trained Models and Online Self-Supervision

Mat'ias Mattamala, Jonas Frey, Piotr Libera, Nived Chebrolu, Georg Martius, Cesar Cadena, Marco Hutter, Maurice Fallon

Natural environments such as forests and grasslands are challenging for robotic navigation because of the false perception of rigid obstacles from high grass, twigs, or bushes. In this work, we present Wild Visual Navigation (WVN), an online self-supervised learning system for visual traversability estimation. The system is able to continuously adapt from a short human demonstration in the field, only using onboard sensing and computing. One of the key ideas to achieve this is the use of high-dimensional features from pre-trained self-supervised models, which implicitly encode semantic information that massively simplifies the learning task. Further, the development of an online scheme for supervision generator enables concurrent training and inference of the learned model in the wild. We demonstrate our approach through diverse real-world deployments in forests, parks, and grasslands. Our system is able to bootstrap the traversable terrain segmentation in less than 5 min of in-field training time, enabling the robot to navigate in complex, previously unseen outdoor terrains. Code: https://bit.ly/498b0CV - Project page:https://bit.ly/3M6nMHH

4/11/2024

cs.RO cs.CV cs.LG

Learning Semantic Traversability with Egocentric Video and Automated Annotation Strategy

Yunho Kim, Jeong Hyun Lee, Choongin Lee, Juhyeok Mun, Donghoon Youm, Jeongsoo Park, Jemin Hwangbo

For reliable autonomous robot navigation in urban settings, the robot must have the ability to identify semantically traversable terrains in the image based on the semantic understanding of the scene. This reasoning ability is based on semantic traversability, which is frequently achieved using semantic segmentation models fine-tuned on the testing domain. This fine-tuning process often involves manual data collection with the target robot and annotation by human labelers which is prohibitively expensive and unscalable. In this work, we present an effective methodology for training a semantic traversability estimator using egocentric videos and an automated annotation process. Egocentric videos are collected from a camera mounted on a pedestrian's chest. The dataset for training the semantic traversability estimator is then automatically generated by extracting semantically traversable regions in each video frame using a recent foundation model in image segmentation and its prompting technique. Extensive experiments with videos taken across several countries and cities, covering diverse urban scenarios, demonstrate the high scalability and generalizability of the proposed annotation method. Furthermore, performance analysis and real-world deployment for autonomous robot navigation showcase that the trained semantic traversability estimator is highly accurate, able to handle diverse camera viewpoints, computationally light, and real-world applicable. The summary video is available at https://youtu.be/EUVoH-wA-lA.

6/6/2024

cs.RO cs.AI

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang

Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

5/28/2024

cs.CV cs.RO

VIEW: Visual Imitation Learning with Waypoints

Ananth Jonnavittula, Sagar Parekh, Dylan P. Losey

Robots can use Visual Imitation Learning (VIL) to learn everyday tasks from video demonstrations. However, translating visual observations into actionable robot policies is challenging due to the high-dimensional nature of video data. This challenge is further exacerbated by the morphological differences between humans and robots, especially when the video demonstrations feature humans performing tasks. To address these problems we introduce Visual Imitation lEarning with Waypoints (VIEW), an algorithm that significantly enhances the sample efficiency of human-to-robot VIL. VIEW achieves this efficiency using a multi-pronged approach: extracting a condensed prior trajectory that captures the demonstrator's intent, employing an agent-agnostic reward function for feedback on the robot's actions, and utilizing an exploration algorithm that efficiently samples around waypoints in the extracted trajectory. VIEW also segments the human trajectory into grasp and task phases to further accelerate learning efficiency. Through comprehensive simulations and real-world experiments, VIEW demonstrates improved performance compared to current state-of-the-art VIL methods. VIEW enables robots to learn a diverse range of manipulation tasks involving multiple objects from arbitrarily long video demonstrations. Additionally, it can learn standard manipulation tasks such as pushing or moving objects from a single video demonstration in under 30 minutes, with fewer than 20 real-world rollouts. Code and videos here: https://collab.me.vt.edu/view/

4/30/2024

cs.RO