Distilling an End-to-End Voice Assistant Without Instruction Training Data

Held, William, Li, Ella, Ryan, Michael, Shi, Weiyan, Zhang, Yanzhe, Yang, Diyi

Oct-3-2024–arXiv.org Artificial Intelligence

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models "forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute. Figure 1: Training Pipeline for Distilled Voice Assistant (DiVA), Red indicates trainable components while Blue indicates frozen pretrained modules. DiVA modifies a text-only LLM into a general purpose Speech LLM by using the model's own responses to transcribed speech as self-supervision. As Large Language Models (LLMs) capabilities increase, so does the value of bringing these capabilities to new modalities, including audio and speech (Shu et al., 2023; Wang et al., 2023; Gong et al., 2023). Speech is a natural interaction surface for language technology (Murad et al., 2019), offering measurable efficiency gains for users (Ruan et al., 2018). One straightforward method of integrating speech with LLMs is to feed audio to an Automatic Speech Recognition (ASR) model and produce a text transcription for the LLM to use. All authors besides first and last sorted alphabetically.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Oct-3-2024

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- Europe (0.93)
- North America > United States (0.46)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language > Large Language Model (1.00)
  - Speech > Speech Recognition (1.00)