Review for NeurIPS paper: Learning Representations from Audio-Visual Spatial Alignment
–Neural Information Processing Systems
Saying the models completely disregard spatial information is too strong a statement as these models can easily be repurposed to localize sound sources to some extent. I believe there is some miscommunication. I meant using the model for a downstream task that requires audio visual spatial alignment. The authors report results of the AVSA self-supervision task and compare it to other methods like AVC. But that is the self-supervision task or pre-text task setup rather than an actual downstream task.
Neural Information Processing Systems
Jan-23-2025, 06:04:03 GMT