Demonstrating Agile Flight from Pixels without State Estimation

Read original: arXiv:2406.12505 - Published 6/19/2024 by Ismail Geles, Leonard Bauersfeld, Angel Romero, Jiaxu Xing, Davide Scaramuzza

Demonstrating Agile Flight from Pixels without State Estimation

Overview

This research paper presents a novel approach for enabling agile flight in drones using only visual information, without the need for state estimation.
The authors demonstrate how their system can perform complex aerial maneuvers, such as aggressive flips and rolls, solely based on raw pixel data from onboard cameras.
This work has important implications for developing autonomous drone systems that can navigate challenging environments and perform acrobatic feats without relying on specialized sensors or state estimation algorithms.

Plain English Explanation

The researchers in this paper have developed a way for drones to perform complex, agile flight maneuvers using only the information from their onboard cameras, without the need for additional sensors or state estimation. Typically, drones rely on various sensors to keep track of their position, orientation, and speed, which allows them to execute precise movements. However, the approach described in this paper enables the drone to perform acrobatic stunts, such as flips and rolls, solely based on the raw pixel data from its cameras.

This is a significant advancement because it means drones can navigate challenging environments and perform impressive aerial feats without the need for sophisticated state estimation algorithms or additional hardware. The researchers have essentially taught the drone to "see" the world and react accordingly, rather than relying on precise measurements of its own state. This could lead to the development of more robust and versatile autonomous drone systems that can operate in a wide range of scenarios without the constraints of traditional approaches.

Technical Explanation

The key innovation in this paper is the authors' use of vision transformers to process the raw pixel data from the drone's cameras and directly output control commands for the flight actuators. This end-to-end approach eliminates the need for a separate state estimation module, which is typically a critical component in drone control systems.

The authors trained their vision transformer model using a large dataset of simulated drone flights, where the input is the camera images and the target output is the appropriate control commands to perform the desired maneuvers. By learning this direct mapping from pixels to actions, the model is able to execute complex aerial behaviors without explicitly tracking the drone's position, orientation, or velocity.

The experimental results demonstrate the system's ability to perform aggressive flips, rolls, and other acrobatic movements in both simulation and real-world environments. The authors also show that their approach can generalize to unseen scenarios and is more robust to external disturbances compared to traditional state estimation-based controllers.

Critical Analysis

One potential limitation of this approach is the reliance on a large, high-quality dataset of simulated drone flights to train the vision transformer model. In real-world scenarios, the drone may encounter a wider range of environmental conditions and unexpected situations that are not well represented in the training data. The authors acknowledge this challenge and suggest the need for further research into continual learning techniques to improve the model's adaptability and robustness.

Additionally, the performance of the system may be sensitive to the quality and resolution of the onboard cameras. In situations with poor lighting, occlusions, or low-quality images, the vision transformer model may struggle to accurately interpret the visual inputs and generate appropriate control commands. The authors do not extensively explore the system's performance under these types of challenging conditions.

Conclusion

This research paper presents a novel approach for enabling agile flight in drones using only visual information, without the need for state estimation. By training a vision transformer model to directly map camera images to flight control commands, the authors have developed a system that can perform complex aerial maneuvers, such as flips and rolls, solely based on raw pixel data.

This work has important implications for the development of more robust and versatile autonomous drone systems that can navigate challenging environments and perform impressive acrobatic feats without relying on specialized sensors or sophisticated state estimation algorithms. While the approach shows promising results, further research is needed to address the potential limitations, such as the reliance on high-quality training data and the sensitivity to environmental factors that may affect the quality of the visual inputs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Demonstrating Agile Flight from Pixels without State Estimation

Ismail Geles, Leonard Bauersfeld, Angel Romero, Jiaxu Xing, Davide Scaramuzza

Quadrotors are among the most agile flying robots. Despite recent advances in learning-based control and computer vision, autonomous drones still rely on explicit state estimation. On the other hand, human pilots only rely on a first-person-view video stream from the drone onboard camera to push the platform to its limits and fly robustly in unseen environments. To the best of our knowledge, we present the first vision-based quadrotor system that autonomously navigates through a sequence of gates at high speeds while directly mapping pixels to control commands. Like professional drone-racing pilots, our system does not use explicit state estimation and leverages the same control commands humans use (collective thrust and body rates). We demonstrate agile flight at speeds up to 40km/h with accelerations up to 2g. This is achieved by training vision-based policies with reinforcement learning (RL). The training is facilitated using an asymmetric actor-critic with access to privileged information. To overcome the computational complexity during image-based RL training, we use the inner edges of the gates as a sensor abstraction. This simple yet robust, task-relevant representation can be simulated during training without rendering images. During deployment, a Swin-transformer-based gate detector is used. Our approach enables autonomous agile flight with standard, off-the-shelf hardware. Although our demonstration focuses on drone racing, we believe that our method has an impact beyond drone racing and can serve as a foundation for future research into real-world applications in structured environments.

6/19/2024

Back to Newton's Laws: Learning Vision-based Agile Flight via Differentiable Physics

Yuang Zhang, Yu Hu, Yunlong Song, Danping Zou, Weiyao Lin

Swarm navigation in cluttered environments is a grand challenge in robotics. This work combines deep learning with first-principle physics through differentiable simulation to enable autonomous navigation of multiple aerial robots through complex environments at high speed. Our approach optimizes a neural network control policy directly by backpropagating loss gradients through the robot simulation using a simple point-mass physics model and a depth rendering engine. Despite this simplicity, our method excels in challenging tasks for both multi-agent and single-agent applications with zero-shot sim-to-real transfer. In multi-agent scenarios, our system demonstrates self-organized behavior, enabling autonomous coordination without communication or centralized planning - an achievement not seen in existing traditional or learning-based methods. In single-agent scenarios, our system achieves a 90% success rate in navigating through complex environments, significantly surpassing the 60% success rate of the previous state-of-the-art approach. Our system can operate without state estimation and adapt to dynamic obstacles. In real-world forest environments, it navigates at speeds up to 20 m/s, doubling the speed of previous imitation learning-based solutions. Notably, all these capabilities are deployed on a budget-friendly $21 computer, costing less than 5% of a GPU-equipped board used in existing systems. Video demonstrations are available at https://youtu.be/LKg9hJqc2cc.

7/17/2024

🗣️

Learning to Fly in Seconds

Jonas Eschmann, Dario Albani, Giuseppe Loianno

Learning-based methods, particularly Reinforcement Learning (RL), hold great promise for streamlining deployment, enhancing performance, and achieving generalization in the control of autonomous multirotor aerial vehicles. Deep RL has been able to control complex systems with impressive fidelity and agility in simulation but the simulation-to-reality transfer often brings a hard-to-bridge reality gap. Moreover, RL is commonly plagued by prohibitively long training times. In this work, we propose a novel asymmetric actor-critic-based architecture coupled with a highly reliable RL-based training paradigm for end-to-end quadrotor control. We show how curriculum learning and a highly optimized simulator enhance sample complexity and lead to fast training times. To precisely discuss the challenges related to low-level/end-to-end multirotor control, we also introduce a taxonomy that classifies the existing levels of control abstractions as well as non-linearities and domain parameters. Our framework enables Simulation-to-Reality (Sim2Real) transfer for direct RPM control after only 18 seconds of training on a consumer-grade laptop as well as its deployment on microcontrollers to control a multirotor under real-time guarantees. Finally, our solution exhibits competitive performance in trajectory tracking, as demonstrated through various experimental comparisons with existing state-of-the-art control solutions using a real Crazyflie nano quadrotor. We open source the code including a very fast multirotor dynamics simulator that can simulate about 5 months of flight per second on a laptop GPU. The fast training times and deployment to a cheap, off-the-shelf quadrotor lower the barriers to entry and help democratize the research and development of these systems.

4/10/2024

Whole-Body Control Through Narrow Gaps From Pixels To Action

Tianyue Wu, Yeke Chen, Tianyang Chen, Guangyu Zhao, Fei Gao

Flying through body-size narrow gaps in the environment is one of the most challenging moments for an underactuated multirotor. We explore a purely data-driven method to master this flight skill in simulation, where a neural network directly maps pixels and proprioception to continuous low-level control commands. This learned policy enables whole-body control through gaps with different geometries demanding sharp attitude changes (e.g., near-vertical roll angle). The policy is achieved by successive model-free reinforcement learning (RL) and online observation space distillation. The RL policy receives (virtual) point clouds of the gaps' edges for scalable simulation and is then distilled into the high-dimensional pixel space. However, this flight skill is fundamentally expensive to learn by exploring due to restricted feasible solution space. We propose to reset the agent as states on the trajectories by a model-based trajectory optimizer to alleviate this problem. The presented training pipeline is compared with baseline methods, and ablation studies are conducted to identify the key ingredients of our method. The immediate next step is to scale up the variation of gap sizes and geometries in anticipation of emergent policies and demonstrate the sim-to-real transformation.

9/4/2024