Review for NeurIPS paper: Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Jan-25-2025, 14:22:15 GMT–Neural Information Processing Systems

Weaknesses: - Despite the extensive empirical evaluations, the three multimodal variants as proposed by the paper are direct extensions of the DeepCluster algorithm [4]. The main contributions appear to be (1) a working pipeline which demonstrates that variants of DeepCluster works with video and audio encoders; (2) scaling up the training to extremely large datasets. While both contributions are interesting, they appear to me to be less relevant to the audience of NeurIPS. It would also be great if such conjectures are accompanied with empirical evaluations on more diverse tasks than the three classification datasets. That would help the audience understand when to apply the XDC variant of DeepCluster (e.g. is it specific to audio and visual in videos, or is it more general?),

cross-modal audio-video clustering, neurips paper, self-supervised learning, (5 more...)

Neural Information Processing Systems

Jan-25-2025, 14:22:15 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)