Cross-Modal Knowledge Distillation for Speech Large Language Models
Wang, Enzhi, Li, Qicheng, Tang, Zhiyuan, Jia, Yuhang
–arXiv.org Artificial Intelligence
ABSTRACT In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Index T erms-- Speech LLMs, Cross-Modal Knowledge Distillation, Catastrophic Forgetting, Modality Inequivalence, Question Answering 1. INTRODUCTION In recent years, large language models (LLMs) have made remarkable progress in multimodal capabilities, with voice interaction emerging as a key application direction. Cutting-edge models such as GPT -4o [1] already enable real-time spoken dialogue, providing users with more natural, flexible, and high-quality interaction experiences compared to traditional text-based systems. Building on this trend, many researchers have begun extending pretrained text LLMs into the speech domain, constructing large speech models with both speech understanding and generation abilities.
arXiv.org Artificial Intelligence
Sep-19-2025