Cross-Modal Knowledge Distillation for Speech Large Language Models

Wang, Enzhi, Li, Qicheng, Tang, Zhiyuan, Jia, Yuhang

arXiv.org Artificial Intelligence 

ABSTRACT In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Index T erms-- Speech LLMs, Cross-Modal Knowledge Distillation, Catastrophic Forgetting, Modality Inequivalence, Question Answering 1. INTRODUCTION In recent years, large language models (LLMs) have made remarkable progress in multimodal capabilities, with voice interaction emerging as a key application direction. Cutting-edge models such as GPT -4o [1] already enable real-time spoken dialogue, providing users with more natural, flexible, and high-quality interaction experiences compared to traditional text-based systems. Building on this trend, many researchers have begun extending pretrained text LLMs into the speech domain, constructing large speech models with both speech understanding and generation abilities.