End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

Wang, Jixuan, Radfar, Martin, Wei, Kai, Chung, Clement

arXiv.org Artificial Intelligence 

It is challenging to extract semantic meanings directly from The limitation of the above approaches is that they cannot audio signals in spoken language understanding (SLU), due be used for sequence labeling tasks, like slot filling. To address to the lack of textual information. Popular end-to-end (E2E) this issue, another stream of works build unified models, SLU models utilize sequence-to-sequence automatic speech which can be trained end-to-end and used for both intent recognition (ASR) models to extract textual embeddings as classification and slot filling. One way to achieve E2E training input to infer semantics, which, however, require computationally is to re-frame SLU as a sequence-to-sequence task, where expensive auto-regressive decoding. In this work, semantic labels are treated as another sequence of output labels we leverage self-supervised acoustic encoders fine-tuned besides the transcript [9-12]. Another way is to unify with Connectionist Temporal Classification (CTC) to extract ASR and NLU models and train them together via differentiable textual embeddings and use joint CTC and SLU losses for neural interfaces [13-16]. One commonly used neural utterance-level SLU tasks. Experiments show that our model interface is to feed the token level hidden representations from achieves 4% absolute improvement over the the state-of-theart ASR as input to the NLU model [13-16].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found