Transformer-based encoder-encoder architecture for Spoken Term Detection
Švec, Jan, Šmídl, Luboš, Lehečka, Jan
–arXiv.org Artificial Intelligence
The paper presents a method for spoken term detection based on In this work, we do not focus on the direct processing of the the Transformer architecture. We propose the encoder encoder input speech signal. Instead, we use the speech recognizer to convert architecture employing two BERT-like encoders with additional an audio signal into a graphemic recognition hypothesis. The modifications, including convolutional and upsampling layers, attention representation of speech at the grapheme level allows preprocessing masking, and shared parameters. The encoders project a the input audio into a compact confusion network and further to a recognized hypothesis and a searched term into a shared embedding sequence of embedding vectors. In [7], we proposed a Deep LSTM space, where the score of the putative hit is computed using the calibrated architecture for spoken term detection, which uses the projection dot product. In the experiments, we used the Wav2Vec 2.0 of both the input speech and searched term into a shared embedding speech recognizer, and the proposed system outperformed a baseline space. The hybrid DNN-HMM speech recognizer produced method based on deep LSTMs on the English and Czech STD phoneme confusion networks representing the input speech. The datasets based on USC Shoah Foundation Visual History Archive DNN-HMM speech recognizer can be replaced with the Wav2Vec (MALACH).
arXiv.org Artificial Intelligence
Nov-2-2022
- Country:
- Europe > Czechia (0.04)
- North America > United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Washington > King County
- Seattle (0.04)
- Minnesota > Hennepin County
- Genre:
- Research Report (0.64)
- Technology: