Joint Audio and Speech Understanding

Gong, Yuan, Liu, Alexander H., Luo, Hongyin, Karlinsky, Leonid, Glass, James

Dec-10-2023–arXiv.org Artificial Intelligence

Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.

dataset, information, ltu-as, (13 more...)

arXiv.org Artificial Intelligence

Dec-10-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment > Sports > Basketball (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)