Each 3D segment is associated with a CLIP descriptor, selected from the descriptors of its 2D observations.
To generate descriptors for 2D segments, we combine three CLIP descriptors for each 2D mask: one for the full image, one for the segment with the background removed, and one for the minimum bounding box containing the full 2D mask.
Then, a pre-trained neural network predicts a weight for each dimension of each descriptor, and finally these descriptors are merged together using a weighted-average sum.
After, pre-training the neural-network in ScanNet++ we validate its performance with zero-shot evaluation in Replica and ScanNet200, outperforming previous approaches.
We also showcase its ability to retain language-image properties evaluating generic phrases as queries rather than only classes.