Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Jun-12-2026, 12:35:59 GMT–Neural Information Processing Systems

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2\% and 4.8\%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Jun-12-2026, 12:35:59 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.58)
  - Representation & Reasoning > Spatial Reasoning (0.44)