DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

Shon, Suwon, Kim, Kwangyoun, Hsu, Yi-Te, Sridhar, Prashant, Watanabe, Shinji, Livescu, Karen

Jun-13-2024–arXiv.org Artificial Intelligence

The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.

arxiv preprint arxiv, dataset, speech adapter, (13 more...)

arXiv.org Artificial Intelligence

Jun-13-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Illinois > Cook County > Chicago (0.04)
- Asia > South Korea
  - Gyeonggi-do > Suwon (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found