SSR: Alignment-Aware Modality Connector for Speech Language Models

Tan, Weiting, Inaguma, Hirofumi, Dong, Ning, Tomasello, Paden, Ma, Xutai

Sep-30-2024–arXiv.org Artificial Intelligence

Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. In this work, we focus on integrating speech into pre-trained language models (SpeechLMs). A straightforward approach is to transcribe speech into text and use these transcriptions as prompts for large language models (Huang et al., 2023); however, such cascaded systems suffer from error propagation, higher latency, and cannot leverage raw speech information like emotion, speaker identity, and other paralinguistic cues (Faruqui & Hakkani-Tür, 2021; Lin et al., 2022; Kim et al., 2024). Speech representations can be integrated into pre-trained language models mainly through two approaches. The first method involves using connector modules that align speech representations with the language model's input space without modifying the model's existing vocabulary. These connector-based techniques typically incorporate a compression module to shorten the speech features, enhancing efficiency.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Sep-30-2024

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- North America > United States (0.46)

Genre:
- Research Report (1.00)

Industry:
- Education > Curriculum > Subject-Specific Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Speech > Speech Recognition (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found