AITopics | Wang, Peidong

Collaborating Authors

Wang, Peidong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Wang, Peidong, Sun, Eric, Xue, Jian, Wu, Yu, Zhou, Long, Gaur, Yashesh, Liu, Shujie, Li, Jinyu

arXiv.org Artificial IntelligenceOct-19-2023

Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models.

artificial intelligence, encoder, speech recognition, (16 more...)

arXiv.org Artificial Intelligence

2211.02809

Genre: Research Report > New Finding (0.35)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Chen, Junkun, Xue, Jian, Wang, Peidong, Pan, Jing, Li, Jinyu

arXiv.org Artificial IntelligenceOct-6-2023

Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. The experiments demonstrate the proposed method can significantly improve the decoding stability without compromising substantially on the translation quality.

machine learning, natural language, translation, (19 more...)

arXiv.org Artificial Intelligence

2310.04399

Genre: Research Report (0.64)

Industry:

Media > Film (0.34)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

Papi, Sara, Wang, Peidong, Chen, Junkun, Xue, Jian, Li, Jinyu, Gaur, Yashesh

arXiv.org Artificial IntelligenceOct-2-2023

ABSTRACT In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic Figure 1. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target while incrementally receiving additional speech content. Experiments particular, only Weller et al., 2021 [10] proposed a unifieddecoder in monolingual (it-en) and multilingual ({de,es,it}- solution for real-time applications that, however, en) settings demonstrate that our approach achieves the best leverages a fully attention-based encoder-decoder (AED) architecture quality-latency balance. With an average ASR latency of 1s [11], which is theoretically not well suited for and ST latency of 1.3s, our model shows no degradation or the streaming scenario [12], and adopts the re-translation even improves output quality compared to separate ASR and approach [13], which is well-known to be affected by the ST models, yielding an average improvement of 1.1 WER and flickering problem [14].

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2307.03354

Country: Europe > Italy (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.99)

Add feedback

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Sun, Eric, Li, Jinyu, Hu, Yuxuan, Zhu, Yimeng, Zhou, Long, Xue, Jian, Wang, Peidong, Liu, Linquan, Liu, Shujie, Lin, Edward, Gong, Yifan

arXiv.org Artificial IntelligenceJul-7-2023

We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we construct multilingual transformer blocks and utilize linear experts to effectively regularize the joint network. The curriculum training scheme leverages LID to guide the gated experts in improving their respective language performance. Experimental results on a bilingual task involving English and Spanish demonstrate significant improvements, with average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Notably, our method achieves performance comparable to the upper-bound model trained and inferred with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those observed in the bilingual models, highlighting its ease of extension to multiple languages.

artificial intelligence, information, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2303.00786

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Xue, Jian, Wang, Peidong, Li, Jinyu, Sun, Eric

arXiv.org Artificial IntelligenceJul-5-2023

In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.02499

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback