Self-Supervised Learning by Cross-Modal Audio-Video Clustering