Learning Representations from Audio-Visual Spatial Alignment Pedro Morgado Yi Li