Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective