Given the super-rich semantics in 3D scenes, a crucial aspect of this task is achieving open-vocabulary segmentation that can handle regions and objects of various semantics including those with long-tail distributions.
Manyfew-shot segmentation (FSS) methods use cross attention to fuse support foreground (FG) into query features, regardless of the quadratic complexity.