Review for NeurIPS paper: Learning Representations from Audio-Visual Spatial Alignment