Each 3D segment is associated with a CLIP descriptor, selected from the descriptors of its 2D observations.
To generate descriptors for 2D segments, we combine three CLIP descriptors for each 2D mask: one for the full image, one for the segment with the background removed, and one for the minimum bounding box containing the full 2D mask.
Then, a pre-trained neural network predicts a weight for each dimension of each descriptor, and finally these descriptors are merged together using a weighted-average sum.