Audio-visual fine-tuning of audio-only ASR models

Open in new window