Coupling Speech Encoders with Downstream Text Models