Learning Representations from Audio-Visual Spatial Alignment Pedro Morgado Yi Li
–Neural Information Processing Systems
While these approaches learn high-quality representations for downstream tasks such as action recognition, their training objectives disregard spatial cues naturally occurring in audio and visual signals.
Neural Information Processing Systems
Oct-2-2025, 15:04:22 GMT