Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Neural Information Processing Systems 

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2\% and 4.8\%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception.