AITopics | Sun, Eric

Collaborating Authors

Sun, Eric

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Target word activity detector: An approach to obtain ASR word boundaries without lexicon

Sivasankaran, Sunit, Sun, Eric, Li, Jinyu, Huang, Yan, Pan, Jing

arXiv.org Artificial IntelligenceSep-20-2024

Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.

information, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2409.13913

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Wang, Peidong, Sun, Eric, Xue, Jian, Wu, Yu, Zhou, Long, Gaur, Yashesh, Liu, Shujie, Li, Jinyu

arXiv.org Artificial IntelligenceOct-19-2023

Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models.

artificial intelligence, encoder, speech recognition, (16 more...)

arXiv.org Artificial Intelligence

2211.02809

Genre: Research Report > New Finding (0.35)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Sun, Eric, Li, Jinyu, Hu, Yuxuan, Zhu, Yimeng, Zhou, Long, Xue, Jian, Wang, Peidong, Liu, Linquan, Liu, Shujie, Lin, Edward, Gong, Yifan

arXiv.org Artificial IntelligenceJul-7-2023

We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we construct multilingual transformer blocks and utilize linear experts to effectively regularize the joint network. The curriculum training scheme leverages LID to guide the gated experts in improving their respective language performance. Experimental results on a bilingual task involving English and Spanish demonstrate significant improvements, with average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Notably, our method achieves performance comparable to the upper-bound model trained and inferred with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those observed in the bilingual models, highlighting its ease of extension to multiple languages.

artificial intelligence, information, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2303.00786

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Xue, Jian, Wang, Peidong, Li, Jinyu, Sun, Eric

arXiv.org Artificial IntelligenceJul-5-2023

In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.02499

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

Kumatani, Kenichi, Gmyr, Robert, Salinas, Felipe Cruz, Liu, Linquan, Zuo, Wei, Patel, Devang, Sun, Eric, Shi, Yu

arXiv.org Artificial IntelligenceJan-4-2022

The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.3% and 4.6% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, Figure 1: Schematic diagram of MoE Transformer encoder the use of language ID and the label decoder with the MoE.

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2112.0582

Country: Europe > Germany (0.14)

Genre: Research Report (0.50)

Industry: Telecommunications (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Meng, Zhong, Wu, Yu, Kanda, Naoyuki, Lu, Liang, Chen, Xie, Ye, Guoli, Sun, Eric, Li, Jinyu, Gong, Yifan

arXiv.org Artificial IntelligenceJun-4-2021

Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tuned extensively on well-matched validation sets. In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF), we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors. Additional gradient is induced when internal LM is engaged in MWER-ILME loss computation. During inference, LM weights pre-determined in MWER training enable robust LM integrations on test sets from different domains. Experimented with 30K-hour trained transformer transducers, MWER-ILME achieves on average 8.8% and 5.8% relative WER reductions from MWER and MWER-SF training, respectively, on 6 different test sets

deep learning, neural network, speech recognition, (20 more...)

arXiv.org Artificial Intelligence

2106.02302

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

Meng, Zhong, Kanda, Naoyuki, Gaur, Yashesh, Parthasarathy, Sarangarajan, Sun, Eric, Lu, Liang, Chen, Xie, Li, Jinyu, Gong, Yifan

arXiv.org Artificial IntelligenceFeb-2-2021

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method. In this method, the internal LM score is subtracted from the score obtained by interpolating the E2E score with the external LM score, during inference. To improve the ILME-based inference, we propose an internal LM training (ILMT) method to minimize an additional internal LM loss by updating only the E2E model components that affect the internal LM estimation. ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy. After ILMT, the more modular E2E model with matched training and inference criteria enables a more thorough elimination of the source-domain internal LM, and therefore leads to a more effective integration of the target-domain external LM. Experimented with 30K-hour trained recurrent neural network transducer and attention-based encoder-decoder models, ILMT with ILME-based inference achieves up to 31.5% and 11.4% relative word error rate reductions from standard E2E training with Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.

deep learning, inference, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

2102.0138

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Self-Teaching Networks

Lu, Liang, Sun, Eric, Gong, Yifan

arXiv.org Machine LearningSep-9-2019

We propose self-teaching networks to improve the generalization capacity of deep neural networks. The idea is to generate soft supervision labels using the output layer for training the lower layers of the network. During the network training, we seek an auxiliary loss that drives the lower layer to mimic the behavior of the output layer. The connection between the two network layers through the auxiliary loss can help the gradient flow, which works similar to the residual networks. Furthermore, the auxiliary loss also works as a regularizer, which improves the generalization capacity of the network. We evaluated the self-teaching network with deep recurrent neural networks on speech recognition tasks, where we trained the acoustic model using 30 thousand hours of data. We tested the acoustic model using data collected from 4 scenarios. We show that the self-teaching network can achieve consistent improvements and outperform existing methods such as label smoothing and confidence penalization.

deep learning, neural network, self-teaching network, (20 more...)

arXiv.org Machine Learning

1909.04157

Genre: Research Report (0.66)

Industry: Education (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Location3: How Users Share and Respond to Location-Based Data on Social

Chang, Jonathan (Facebook) | Sun, Eric (Facebook)

AAAI ConferencesJul-12-2011

In August 2010 Facebook launched Places, a location-based service that allows users to check into points of interest and share their physical whereabouts with friends. The friends who see these events in their News Feed can then respond to these check-ins by liking or commenting on them. These data consisting of the places people go and how their friends react to them are a rich, novel dataset. In this paper we first analyze this dataset to understand the factors that influence where users check in, including previous check-ins, similarity to other places, where their friends check in, time of day, and demographics. We show how these factors can be used to build a predictive model of where users will check in next. Then we analyze how users respond to their friends’ check-ins and which factors contribute to users liking or commenting on them. We show how this can be used to improve the ranking of check-in stories, ensuring that users see only the most relevant updates from their friends and ensuring that businesses derive maximum value from check-ins at their establishments. Finally, we construct a model to predict friendship based on check-in count and show that cocheck-ins has a statistically significant effect on friendship.

actor, artificial intelligence, social media, (18 more...)

AAAI Conferences

Fifth International AAAI Conference on Weblogs and Social Media

Country: North America > United States > California > San Francisco County > San Francisco (0.15)

Industry: Information Technology > Services (0.69)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback