Vision Transformers are Parameter-Efficient Audio-Visual Learners

Open in new window