Goto

Collaborating Authors

 Sun, Eric


Target word activity detector: An approach to obtain ASR word boundaries without lexicon

arXiv.org Artificial Intelligence

Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.


LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

arXiv.org Artificial Intelligence

Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models.


Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

arXiv.org Artificial Intelligence

We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we construct multilingual transformer blocks and utilize linear experts to effectively regularize the joint network. The curriculum training scheme leverages LID to guide the gated experts in improving their respective language performance. Experimental results on a bilingual task involving English and Spanish demonstrate significant improvements, with average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Notably, our method achieves performance comparable to the upper-bound model trained and inferred with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those observed in the bilingual models, highlighting its ease of extension to multiple languages.


A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

arXiv.org Artificial Intelligence

In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.


Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

arXiv.org Artificial Intelligence

The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.3% and 4.6% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, Figure 1: Schematic diagram of MoE Transformer encoder the use of language ID and the label decoder with the MoE.


Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

arXiv.org Artificial Intelligence

Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tuned extensively on well-matched validation sets. In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF), we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors. Additional gradient is induced when internal LM is engaged in MWER-ILME loss computation. During inference, LM weights pre-determined in MWER training enable robust LM integrations on test sets from different domains. Experimented with 30K-hour trained transformer transducers, MWER-ILME achieves on average 8.8% and 5.8% relative WER reductions from MWER and MWER-SF training, respectively, on 6 different test sets


Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

arXiv.org Artificial Intelligence

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method. In this method, the internal LM score is subtracted from the score obtained by interpolating the E2E score with the external LM score, during inference. To improve the ILME-based inference, we propose an internal LM training (ILMT) method to minimize an additional internal LM loss by updating only the E2E model components that affect the internal LM estimation. ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy. After ILMT, the more modular E2E model with matched training and inference criteria enables a more thorough elimination of the source-domain internal LM, and therefore leads to a more effective integration of the target-domain external LM. Experimented with 30K-hour trained recurrent neural network transducer and attention-based encoder-decoder models, ILMT with ILME-based inference achieves up to 31.5% and 11.4% relative word error rate reductions from standard E2E training with Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.


Self-Teaching Networks

arXiv.org Machine Learning

We propose self-teaching networks to improve the generalization capacity of deep neural networks. The idea is to generate soft supervision labels using the output layer for training the lower layers of the network. During the network training, we seek an auxiliary loss that drives the lower layer to mimic the behavior of the output layer. The connection between the two network layers through the auxiliary loss can help the gradient flow, which works similar to the residual networks. Furthermore, the auxiliary loss also works as a regularizer, which improves the generalization capacity of the network. We evaluated the self-teaching network with deep recurrent neural networks on speech recognition tasks, where we trained the acoustic model using 30 thousand hours of data. We tested the acoustic model using data collected from 4 scenarios. We show that the self-teaching network can achieve consistent improvements and outperform existing methods such as label smoothing and confidence penalization.


Location3: How Users Share and Respond to Location-Based Data on Social

AAAI Conferences

In August 2010 Facebook launched Places, a location-based service that allows users to check into points of interest and share their physical whereabouts with friends. The friends who see these events in their News Feed can then respond to these check-ins by liking or commenting on them. These data consisting of the places people go and how their friends react to them are a rich, novel dataset. In this paper we first analyze this dataset to understand the factors that influence where users check in, including previous check-ins, similarity to other places, where their friends check in, time of day, and demographics. We show how these factors can be used to build a predictive model of where users will check in next. Then we analyze how users respond to their friends’ check-ins and which factors contribute to users liking or commenting on them. We show how this can be used to improve the ranking of check-in stories, ensuring that users see only the most relevant updates from their friends and ensuring that businesses derive maximum value from check-ins at their establishments. Finally, we construct a model to predict friendship based on check-in count and show that cocheck-ins has a statistically significant effect on friendship.