End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders
Wang, Jixuan, Radfar, Martin, Wei, Kai, Chung, Clement
–arXiv.org Artificial Intelligence
It is challenging to extract semantic meanings directly from The limitation of the above approaches is that they cannot audio signals in spoken language understanding (SLU), due be used for sequence labeling tasks, like slot filling. To address to the lack of textual information. Popular end-to-end (E2E) this issue, another stream of works build unified models, SLU models utilize sequence-to-sequence automatic speech which can be trained end-to-end and used for both intent recognition (ASR) models to extract textual embeddings as classification and slot filling. One way to achieve E2E training input to infer semantics, which, however, require computationally is to re-frame SLU as a sequence-to-sequence task, where expensive auto-regressive decoding. In this work, semantic labels are treated as another sequence of output labels we leverage self-supervised acoustic encoders fine-tuned besides the transcript [9-12]. Another way is to unify with Connectionist Temporal Classification (CTC) to extract ASR and NLU models and train them together via differentiable textual embeddings and use joint CTC and SLU losses for neural interfaces [13-16]. One commonly used neural utterance-level SLU tasks. Experiments show that our model interface is to feed the token level hidden representations from achieves 4% absolute improvement over the the state-of-theart ASR as input to the NLU model [13-16].
arXiv.org Artificial Intelligence
Jun-2-2023
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.94)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence