VISTA: Open-Vocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting

Abstract

We present VISTA (Viewpoint-based Image selection with Semantic Task Awareness), an active exploration method for robots to plan informative trajectories that improve 3D map quality in areas most relevant for task completion. Given an open-vocabulary search instruction (e.g., ``find a person"), VISTA enables a robot to explore its environment to search for the object of interest, while simultaneously building a real-time semantic 3D Gaussian Splatting reconstruction of the scene. The robot navigates its environment by planning receding-horizon trajectories that prioritize semantic similarity to the query and exploration of unseen regions of the environment. To evaluate trajectories, VISTA introduces a novel, efficient viewpoint-semantic coverage metric that quantifies both the geometric view diversity and task relevance in the 3D scene. On static datasets, our coverage metric outperforms state-of-the-art baselines, FisherRF and Bayes' Rays, in computation speed and reconstruction quality. In quadrotor hardware experiments, VISTA achieves 6x higher success rates in challenging maps, compared to baseline methods, while matching baseline performance in less challenging maps. Lastly, we show that VISTA is platform-agnostic by deploying it on a quadrotor drone and a Spot quadruped robot.

System Diagram

Our pipeline can be divided into three main components: 1) Semantic Mapping 2) Trajectory Proposal, and 3) Trajectory Scoring

Semantic SplatBridge

Our Semantic Gaussian Splatting pipeline provides real-time training of the environment map as the quadrotor is flying.

VISTA-Map

Information from the Gaussian Splat is transferred to a voxel grid representation to allow for fast planning and information gain computation.

VISTA-Score

Leveraging the voxel grid representation, we compute a voxel centric information gain metric, where each voxel holds information about the directions it has been viewed from so far. The information gain of a new view-point can be determined via the coverage the new rays would provide.

VISTA-Plan

A Gaussian Mixture Model (GMM) biases trajectories toward frontier voxels and high semantic similarity points.

BibTeX

@article{nagami2025vista,
    title={VISTA: Open-Vocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting}, 
    author={Keiko Nagami and Timothy Chen and Javier Yu and Ola Shorinwa and Maximilian Adang and Carlyn Dougherty and Eric Cristofalo and Mac Schwager},
    journal={arXiv preprint arXiv:2507.01125}
    year={2025},
}