OVO

Abstract

We present an Open-Vocabulary Online 3D semantic mapping for SLAM, , that we denote by its acronym OVO.

Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method.

Notably, our \ours has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than them. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones demonstrating end-to-end open-vocabulary online 3D reconstructions without relying on ground-truth camera poses or scene geometry.

System

Given an input RGB-D video, a visual SLAM pipeline selects a set of keyframes and estimates their poses and a 3D point cloud representing the scene. From the 3D representation, our mapping module extracts and tracks 3D segments and assigns per-3D-segment CLIP vectors aggregated from those extracted in the keyframes. OVO outperforms state-of-the-art Open-Vocabulary 3D segmentation baselines, despite being the only one that can both run online, without groud-truth camera poses, and support loop-closure optimization.

Each 3D segment is associated with a CLIP descriptor, selected from the descriptors of its 2D observations.

To generate descriptors for 2D segments, we combine three CLIP descriptors for each 2D mask: one for the full image, one for the segment with the background removed, and one for the minimum bounding box containing the full 2D mask.

Then, a pre-trained neural network predicts a weight for each dimension of each descriptor, and finally these descriptors are merged together using a weighted-average sum.

After, pre-training the neural-network in ScanNet++ we validate its performance with zero-shot evaluation in Replica and ScanNet200, outperforming previous approaches.

We also showcase its ability to retain language-image properties evaluating generic phrases as queries rather than only classes.

@article{martins2024ovo, title={Open-Vocabulary Online Semantic Mapping for SLAM}, author={Martins, Tomas Berriel and Oswald, Martin R. and Civera, Javier}, journal={IEEE Robotics and Automation Letters}, year={2025}, }

Open-Vocabulary Online Semantic Mapping for SLAM

OVO builds 3D semantic representations with open-vocabulary online 3D segmentation.

Abstract

System

CLIP descriptors

BibTeX