An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment Hugo Malard Michel Olvera 1 Stéphane Lathuiliere 1

Neural Information Processing Systems 

Multimodal large language models have fueled progress in image captioning.