Self Supervised Learning by Cross Modal Audio Video Clustering Supplementary Material
–Neural Information Processing Systems
In this section, we give the details of the full optimization cycle and discuss differences between the single-modality baseline and our multi-modal models. As discussed in [1], SDC may converge to trivial solutions, corresponding to empty clusters or encoder parameterizations, where the classifier predicts the same label regardless of the input. DeepCluster proposes workarounds to tackle these issues, involving reassigning empty cluster centers and sampling training images uniformly over the cluster assignments. While these strategies mitigate the issues, they do not fix the main cause of the problem: SDC learns a discriminative classifier on the same input from which it learns the labels. On the other hand, our multi-modal deep clustering models are less prone to trivial solutions because they learn the discriminative classifier on one modality and obtain the labels from a different modality.
Neural Information Processing Systems
May-21-2025, 12:45:31 GMT