The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders