Transferable speech-to-text large language model alignment module