Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer