Closing the Gap Between Text and Speech Understanding in LLMs

Cuervo, Santiago, Seto, Skyler, de Seyssel, Maureen, Bai, Richard He, Gu, Zijin, Likhomanenko, Tatiana, Jaitly, Navdeep, Aldeneh, Zakaria

Oct-16-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation-- which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.Figure 1: SALAD reduces the text-speech understanding gap while requiring over an order of magnitude less training data than competing speech-adapted LLMs. Work done during an internship at Apple. Large language models (LLMs) have demonstrated impressive capabilities in general knowledge and reasoning, often surpassing specialized systems across a wide range of tasks.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-16-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East > Jordan (0.04)
  - Singapore (0.04)
- Europe
  - Austria > Vienna (0.14)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Greece (0.04)
- North America
  - Mexico > Mexico City
    - Mexico City (0.04)
  - United States
    - Florida > Miami-Dade County
      - Miami (0.04)
    - New Mexico > Bernalillo County
      - Albuquerque (0.04)
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Speech > Speech Recognition (1.00)