Open-Vocabulary Online Semantic Mapping for SLAM

1University of Zaragoza, 2University of Amsterdam

OVO builds 3D semantic representations with open-vocabulary online 3D segmentation.

Abstract

We present an Open-Vocabulary Online 3D semantic mapping for SLAM, , that we denote by its acronym OVO.

Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method.

Notably, our \ours has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than them. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones demonstrating end-to-end open-vocabulary online 3D reconstructions without relying on ground-truth camera poses or scene geometry.

System

OVO pipeline

Given an input RGB-D video, a visual SLAM pipeline selects a set of keyframes and estimates their poses and a 3D point cloud representing the scene. From the 3D representation, our mapping module extracts and tracks 3D segments and assigns per-3D-segment CLIP vectors aggregated from those extracted in the keyframes. OVO outperforms state-of-the-art Open-Vocabulary 3D segmentation baselines, despite being the only one that can both run online and without groud-truth camera poses.

Table 2 from paper

CLIP descriptors

Each 3D segment is associated with a CLIP descriptor, selected from the descriptors of its 2D observations.

To generate descriptors for 2D segments, we combine three CLIP descriptors for each 2D mask: one for the full image, one for the segment with the background removed, and one for the minimum bounding box containing the full 2D mask.

Then, a pre-trained neural network predicts a weight for each dimension of each descriptor, and finally these descriptors are merged together using a weighted-average sum.

Matting example

BibTeX


      @article{martins2024ovo,
        title={Open-Vocabulary Online Semantic Mapping for SLAM},
        author={Martins, Tomas Berriel and Oswald, Martin R and Civera, Javier},
        journal={arXiv preprint arXiv:2411.15043},
        year={2024}
      }