AITopics | Le, Duc

Collaborating Authors

Le, Duc

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

Le, Trang, Lazar, Daniel, Kim, Suyoun, Jiang, Shan, Le, Duc, Sagar, Adithya, Livshits, Aleksandr, Aly, Ahmed, Shrivastava, Akshat

arXiv.org Artificial IntelligenceJun-11-2024

Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.

artificial intelligence, decoder, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.07823

Country:

Europe > Croatia (0.14)
Europe > Belgium (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

StemGen: A music generation model that listens

Parker, Julian D., Spijkervet, Janne, Kosta, Katerina, Yesiler, Furkan, Kuznetsov, Boris, Wang, Ju-Chiang, Avent, Matt, Chen, Jitong, Le, Duc

arXiv.org Artificial IntelligenceJan-16-2024

End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2312.08723

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Seq2seq for Automatic Paraphasia Detection in Aphasic Speech

Perez, Matthew, Le, Duc, Romana, Amrit, Jones, Elise, Licata, Keli, Provost, Emily Mower

arXiv.org Artificial IntelligenceDec-16-2023

Paraphasias are speech errors that are often characteristic of aphasia and they represent an important signal in assessing disease severity and subtype. Traditionally, clinicians manually identify paraphasias by transcribing and analyzing speech-language samples, which can be a time-consuming and burdensome process. Identifying paraphasias automatically can greatly help clinicians with the transcription process and ultimately facilitate more efficient and consistent aphasia assessment. Previous research has demonstrated the feasibility of automatic paraphasia detection by training an automatic speech recognition (ASR) model to extract transcripts and then training a separate paraphasia detection model on a set of hand-engineered features. In this paper, we propose a novel, sequence-to-sequence (seq2seq) model that is trained end-to-end (E2E) to perform both ASR and paraphasia detection tasks. We show that the proposed model outperforms the previous state-of-the-art approach for both word-level and utterance-level paraphasia detection tasks and provide additional follow-up evaluations to further understand the proposed model behavior.

machine learning, natural language, paraphasia detection, (16 more...)

arXiv.org Artificial Intelligence

2312.10518

Country:

North America > United States > California (0.28)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning ASR pathways: A sparse multilingual ASR model

Yang, Mu, Tjandra, Andros, Liu, Chunxi, Zhang, David, Le, Duc, Kalinli, Ozlem

arXiv.org Artificial IntelligenceSep-28-2023

Neural network pruning compresses automatic speech recognition (ASR) models effectively. However, in multilingual ASR, language-agnostic pruning may lead to severe performance drops on some languages because language-agnostic pruning masks may not fit all languages and discard important language-specific parameters. In this work, we present ASR pathways, a sparse multilingual ASR model that activates language-specific sub-networks ("pathways"), such that the parameters for each language are learned explicitly. With the overlapping sub-networks, the shared parameters can also enable knowledge transfer for lower-resource languages via joint multilingual training. We propose a novel algorithm to learn ASR pathways, and evaluate the proposed method on 4 languages with a streaming RNN-T model. Our proposed ASR pathways outperform both dense models and a language-agnostically pruned model, and provide better performance on low-resource languages compared to the monolingual sparse models.

artificial intelligence, language-specific mask, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2209.05735

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.59)

Add feedback

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Kim, Suyoun, Shrivastava, Akshat, Le, Duc, Lin, Ju, Kalinli, Ozlem, Seltzer, Michael L.

arXiv.org Artificial IntelligenceJul-22-2023

End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the Figure 1: The overall architecture of End-to-End Spoken Language analysis to demonstrate the effectiveness of our approach.

artificial intelligence, natural language, nlu component, (15 more...)

arXiv.org Artificial Intelligence

2307.12134

Genre:

Research Report > New Finding (0.47)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Text Generation with Speech Synthesis for ASR Data Augmentation

Huang, Zhuangqun, Keren, Gil, Jiang, Ziran, Jain, Shashank, Goss-Grubbs, David, Cheng, Nelson, Abtahi, Farnaz, Le, Duc, Zhang, David, D'Avirro, Antony, Campbell-Taylor, Ethan, Salas, Jessie, Veliche, Irina-Elena, Chen, Xi

arXiv.org Artificial IntelligenceMay-22-2023

Aiming at reducing the reliance on expensive human annotations, data synthesis for Automatic Speech Recognition (ASR) has remained an active area of research. While prior work mainly focuses on synthetic speech generation for ASR data augmentation, its combination with text generation methods is considerably less explored. In this work, we explore text augmentation for ASR using large-scale pre-trained neural networks, and systematically compare those to traditional text augmentation methods. The generated synthetic texts are then converted to synthetic speech using a text-to-speech (TTS) system and added to the ASR training data. In experiments conducted on three datasets, we find that neural models achieve 9%-15% relative WER improvement and outperform traditional methods. We conclude that text augmentation, particularly through modern neural approaches, is a viable tool for improving the accuracy of ASR systems.

machine learning, natural language, utterance, (19 more...)

arXiv.org Artificial Intelligence

2305.16333

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

Tjandra, Andros, Singhal, Nayan, Zhang, David, Kalinli, Ozlem, Mohamed, Abdelrahman, Le, Duc, Seltzer, Michael L.

arXiv.org Artificial IntelligenceNov-10-2022

End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance transfer from high-resource to low-resource languages. However, scaling up the number of languages, total hours, and number of unique tokens is not a trivial task. This paper explores large-scale multilingual ASR models on 70 languages. We inspect two architectures: (1) Shared embedding and output and (2) Multiple embedding and output model. In the shared model experiments, we show the importance of tokenization strategy across different languages. Later, we use our optimal tokenization strategy to train multiple embedding and output model to further improve our result. Our multilingual ASR achieves 13.9%-15.6% average WER relative improvement compared to monolingual models. We show that our multilingual ASR generalizes well on an unseen dataset and domain, achieving 9.5% and 7.5% WER on Multilingual Librispeech (MLS) with zero-shot and finetuning, respectively.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2211.05756

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language (0.67)

Add feedback

Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

Kim, Suyoun, Li, Ke, Kabela, Lucas, Huang, Rongqing, Zhu, Jiedan, Kalinli, Ozlem, Le, Duc

arXiv.org Artificial IntelligenceOct-31-2022

Recently, there has been an increasing interest in two-pass streaming end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring model on top of the conventional 1st-pass streaming ASR model to improve recognition accuracy while keeping latency low. One of the latest 2nd-pass rescoring model, Transformer Rescorer, takes the n-best initial outputs and audio embeddings from the 1st-pass model, and then choose the best output by re-scoring the n-best initial outputs. However, training this Transformer Rescorer requires expensive paired audio-text training data because the model uses audio embeddings as input. In this work, we present our Joint Audio/Text training method for Transformer Rescorer, to leverage unpaired text-only data which is relatively cheaper than paired audio-text data. We evaluate Transformer Rescorer with our Joint Audio/Text training on Librispeech dataset as well as our large-scale in-house dataset and show that our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer without requiring any extra model parameters or latency.

artificial intelligence, machine learning, transformer rescorer, (11 more...)

arXiv.org Artificial Intelligence

2211.00174

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Alignment Restricted Streaming Recurrent Neural Network Transducer

Mahadeokar, Jay, Shangguan, Yuan, Le, Duc, Keren, Gil, Su, Hang, Le, Thong, Yeh, Ching-Feng, Fuegen, Christian, Seltzer, Michael L.

arXiv.org Artificial IntelligenceNov-5-2020

There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times higher throughput for our LSTM model architecture, enabling faster training and convergence on GPUs.

alignment, deep learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2011.03072

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Le, Duc, Koehler, Thilo, Fuegen, Christian, Seltzer, Michael L.

arXiv.org Machine LearningOct-22-2019

Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel method to train a statistical grapheme-to-grapheme (G2G) model on text-to-speech data that can rewrite an arbitrary character sequence into more phonetically consistent forms. We show that using G2G to provide alternative pronunciations during decoding reduces Word Error Rate by 3% to 11% relative over a strong graphemic baseline and bridges the gap on rare name recognition with an equivalent phonetic setup. Unlike many previously proposed methods, our method does not require any change to the acoustic model training procedure. This work reaffirms the efficacy of grapheme-based modeling and shows that specialized linguistic knowledge, when available, can be leveraged to improve graphemic ASR.

deep learning, neural network, speech recognition, (20 more...)

arXiv.org Machine Learning

1910.12612

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback