Generative linguistic representation for spoken language identification
Shen, Peng, Lu, Xuguang, Kawai, Hisashi
–arXiv.org Artificial Intelligence
Ren et al. proposed a two-step training process, which first trains Effective extraction and application of linguistic features are an acoustic model with a connectionist temporal classification central to the enhancement of spoken Language IDentification (CTC), then a recurrent neural network classifies the language (LID) performance. With the success of recent large category using the intermediate features derived from models, such as GPT and Whisper, the potential to leverage the acoustic model as inputs [10]. Multi-task training methods such pre-trained models for extracting linguistic features for have also been investigated, which enhance performance LID tasks has become a promising area of research. In this paper, and bolster model robustness. This method utilizes the shared we explore the utilization of the decoder-based network underlying feature extraction network and jointly trains objective from the Whisper model to extract linguistic features through functions for speech/phoneme recognition and language its generative mechanism for improving the classification accuracy recognition [9, 11, 12]. Consideration has also been given in LID tasks. We devised two strategies - one based to self-supervised phonotactic representations that use context on the language embedding method and the other focusing information [13, 14].
arXiv.org Artificial Intelligence
Dec-18-2023