The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders

Open in new window