Closing the Gap Between Text and Speech Understanding in LLMs
Cuervo, Santiago, Seto, Skyler, de Seyssel, Maureen, Bai, Richard He, Gu, Zijin, Likhomanenko, Tatiana, Jaitly, Navdeep, Aldeneh, Zakaria
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation-- which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.Figure 1: SALAD reduces the text-speech understanding gap while requiring over an order of magnitude less training data than competing speech-adapted LLMs. Work done during an internship at Apple. Large language models (LLMs) have demonstrated impressive capabilities in general knowledge and reasoning, often surpassing specialized systems across a wide range of tasks.
arXiv.org Artificial Intelligence
Oct-16-2025
- Country:
- Asia
- Middle East > Jordan (0.04)
- Singapore (0.04)
- Europe
- Austria > Vienna (0.14)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Greece (0.04)
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Florida > Miami-Dade County
- Mexico > Mexico City
- South America > Chile
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology: