Scene Optimized Understanding via Synthesized Visual Inertial Data from Experts
Main Videos:
We propose a new simulator, training approach, and policy architecture, collectively called SOUS VIDE, for end-to- end visual drone navigation. Our trained policies exhibit zero- shot sim-to-real transfer with robust real-world performance using only on-board perception and computation. Our simulator, called FiGS, couples a computationally simple drone dynamics model with a high visual fidelity Gaussian Splatting scene re- construction. FiGS can quickly simulate drone flights producing photo-realistic images at over 100 fps. We use FiGS to collect 100k-300k observation-action pairs from an expert MPC with privileged state and dynamics information, randomized over dynamics parameters and spatial disturbances. We then distill this expert MPC into an end-to-end visuomotor policy with a lightweight neural architecture, called SV-Net. SV-Net processes color image and IMU data streams into low-level body rate and thrust commands at 20Hz onboard a drone. Crucially, SV-Net includes a Rapid Motor Adaptation (RMA) module that adapts at runtime to variations in the dynamics parameters of the drone. In extensive hardware experiments, we show SOUS VIDE polices to be robust to ±30% mass and thrust variations, 40 m/s wind gusts, 60% changes in ambient brightness, shifting or removing objects from the scene, and people moving aggressively through the drone’s visual field. The project page and code can be found
The codebase accompanying this research is available on this repository. To replicate our experiments, follow these steps:
git clone https://github.com/username/repository-name.git
pip install -r requirements.txt
python run_experiments.py
This work introduces SOUS VIDE, a novel training paradigm leveraging Gaussian Splatting and lightweight vi- suomotor policy architectures for end-to-end drone navigation. By coupling high-fidelity visual data synthesis with online adaptation mechanisms, SOUS VIDE achieves zero-shot sim- to-real transfer, demonstrating remarkable robustness to varia- tions in mass, thrust, lighting, and dynamic scene changes. Our experiments underscore the policy’s ability to generalize across diverse scenarios, including complex and extended trajectories, with graceful degradation under extreme conditions. Notably, the integration of a streamlined adaptation module enabled the policy to overcome limitations of prior visuomotor approaches, offering a computationally efficient yet effective solution for addressing model inaccuracies. These findings highlight the potential of SOUS VIDE as a foundation for future advancements in autonomous drone navigation. While its robustness and versatility are evident, challenges such as inconsistent performance in multi-objective tasks suggest opportunities for improvement through more sophisticated objective encodings. Further exploration into scaling the approach to more complex environments and in- corporating additional sensory modalities could enhance both adaptability and reliability. Ultimately, this work paves the way for deploying learned visuomotor policies in real-world applications, bridging the gap between simulation and practical autonomy in drone operations.
This research is detailed in our paper titled "[Paper Title]", published at [Conference/Journal Name].
This work was supported in part by DARPA grant HR001120C0107, ONR grant N00014-23-1-2354, and Lincoln Labs grant 7000603941. The second author was supported on an NDSEG fellowhsip. Toyota Research Institute provided funds to support this work.
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.