Distilling Audio-Visual Knowledge by Compositional Contrastive Learning