Aligning Pre-trained Models for Spoken Language Translation