SOUS VIDE: Cooking Visual Drone Navigation Policies in a Gaussian Splatting Vacuum

Stanford University

SOUS VIDE creates end-to-end, zero-shot visual drone navigation policies that are robust to scene changes.

Abstract

We propose a new simulator, training approach, and policy architecture, collectively called SOUS VIDE, for end-to-end visual drone navigation. Our trained policies exhibit zero-shot sim-to-real transfer with robust real-world performance using only on-board perception and computation. Our simulator, called FiGS, couples a computationally simple drone dynamics model with a high visual fidelity Gaussian Splatting scene reconstruction. FiGS can quickly simulate drone flights producing photorealistic images at up to 130 fps. We use FiGS to collect 100k-300k observation-action pairs from an expert MPC with privileged state and dynamics information, randomized over dynamics parameters and spatial disturbances. We then distill this expert MPC into an end-to-end visuomotor policy with a lightweight neural architecture, called SV-Net. SV-Net processes color image, optical flow and IMU data streams into low-level body rate and thrust commands at 20Hz onboard a drone. Crucially, SV-Net includes a Rapid Motor Adaptation (RMA) module that adapts at runtime to variations in drone dynamics. In a campaign of 105 hardware experiments, we show SOUS VIDE policies to be robust to 30% mass variations, 40 m/s wind gusts, 60% changes in ambient brightness, shifting or removing objects from the scene, and people moving aggressively through the drone’s visual field. Code, data, and videos can be found in the links above.

SOUS VIDE Pipeline.

FiGS

Flying in Gaussian Splats (FiGS) is our lightweight simulator that renders images from a Gaussian Splat along the trajectory solution of a simplified 9-dimensional drone dynamics model to produce visual and state data. To produce this data, users need only provide a short video recording of the scene with a single Aruco tag placed within it.

Example data generation from FiGS.

SV-Net

We train a visuomotor navigation policy using our SV-Net architecture, detailed below. Notably, the architecture incorporates a history network to perform a variant of Rapid Motor Adaptation (RMA), effectively addressing variations in drone dynamics between the real world and the simulation environment used to generate the training data.

Example data generation from FiGS.

Deployment and Training Comparison

Real-World Deployment

BibTeX

@article{low2024sousvide,
        author  = {Low, JunEn, and Adang, Max and Yu, Javier and Nagami, Keiko and Schwager, Mac},
        title   = {SOUS VIDE: Cooking Visual Drone Navigation Policies in a Gaussian Splatting Vacuum},
        journal = {IEEE Robotics and Automation Letters (under review)},
        year    = {2024},
        note    = {Available on arXiv: \url{https://arxiv.org/abs/2412.16346}},
        archivePrefix = {arXiv},
        eprint    = {2412.16346},
        primaryClass = {cs.RO},
}