Generalized zero-shot audio-to-intent classification
Elluru, Veera Raghavendra, Kulshreshtha, Devang, Paturi, Rohit, Bodapati, Sravan, Ronanki, Srikanth
–arXiv.org Artificial Intelligence
Spoken language understanding systems using audio-only data are gaining popularity, yet their ability to handle unseen intents remains limited. In this study, we propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. To achieve this, we first train a supervised audio-to-intent classifier by making use of a self-supervised pre-trained model. We then leverage a neural audio synthesizer to create audio embeddings for sample text utterances and perform generalized zero-shot classification on unseen intents using cosine similarity. We also propose a multimodal training strategy that incorporates lexical information into the audio representation to improve zero-shot performance. Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2% for the SLURP and internal goal-oriented dialog datasets, respectively, compared to audio-only training.
arXiv.org Artificial Intelligence
Nov-4-2023
- Country:
- North America > United States > California (0.14)
- Genre:
- Research Report > New Finding (0.88)
- Industry:
- Education (0.31)
- Health & Medicine (0.46)
- Technology: