LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Shikhar, Sambal, Kurpath, Mohammed Irfan, Mullappilly, Sahal Shaji, Lahoud, Jean, Khan, Fahad, Anwer, Rao Muhammad, Khan, Salman, Cholakkal, Hisham

Mar-6-2025–arXiv.org Artificial Intelligence

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

llm, llmvox, preprint arxiv, (16 more...)

arXiv.org Artificial Intelligence

Mar-6-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Sweden
  - Östergötland County > Linköping (0.04)
- Asia
  - Middle East > UAE (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:
- Instructional Material
  - Online (0.40)
  - Course Syllabus & Notes (0.40)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found