Supplementary Material: Learning Representations from Audio-Visual Spatial Alignment