Vision Transformers are Parameter-Efficient Audio-Visual Learners