SupplementaryMaterial: LearningRepresentations fromAudio-VisualSpatialAlignment