AITopics | Galvez, Daniel

Collaborating Authors

Galvez, Daniel

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Training and Inference Efficiency of Encoder-Decoder Speech Models

Żelasko, Piotr, Dhawan, Kunal, Galvez, Daniel, Puvvada, Krishna C., Pasad, Ankita, Koluguri, Nithin Rao, Hu, Ke, Lavrukhin, Vitaly, Balam, Jagadeesh, Ginsburg, Boris

arXiv.org Artificial IntelligenceMar-19-2025

Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source.

machine learning, natural language, translation, (19 more...)

arXiv.org Artificial Intelligence

2503.05931

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

EMMeTT: Efficient Multimodal Machine Translation Training

Żelasko, Piotr, Chen, Zhehuai, Wang, Mengru, Galvez, Daniel, Hrinchuk, Oleksii, Ding, Shuoyang, Hu, Ke, Balam, Jagadeesh, Lavrukhin, Vitaly, Ginsburg, Boris

arXiv.org Artificial IntelligenceSep-20-2024

A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only GPT and encoder-decoder T5, extended with Canary-1B's speech encoder. To handle joint multimodal training, we propose a novel training framework called EMMeTT. EMMeTT improves training efficiency with the following: balanced sampling across languages, datasets, and modalities; efficient sequential data iteration; and a novel 2D bucketing scheme for multimodal data, complemented by a batch size optimizer (OOMptimizer). We show that a multimodal training consistently helps with both architectures. Moreover, SALM-T5 trained with EMMeTT retains the original NMT capability while outperforming AST baselines on four-language subsets of FLORES and FLEURS. The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2409.13523

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Label-Looping: Highly Efficient Decoding for Transducers

Bataev, Vladimir, Xu, Hainan, Galvez, Daniel, Lavrukhin, Vitaly, Ginsburg, Boris

arXiv.org Artificial IntelligenceJun-10-2024

This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blank predictions are handled in the outer loop. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. Experiments show that the label-looping algorithm can bring a speedup up to 2.0X compared to conventional batched decoding algorithms when using batch size 32, and can be combined with other compiler or GPU call-related techniques to bring more speedup. We will open-source our implementation to benefit the research community.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2406.0622

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Galvez, Daniel, Bataev, Vladimir, Xu, Hainan, Kaldewey, Tim

arXiv.org Artificial IntelligenceJun-6-2024

The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle ~80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1.1 billion parameter RNN-T model end-to-end by a factor of 2.5x. This technique can applied to the "label looping" alternative greedy decoding algorithm as well, achieving 1.7x and 1.4x end-to-end speedups when applied to 1.1 billion parameter RNN-T and Token and Duration Transducer models respectively. This work enables a 1.1 billion parameter RNN-T model to run only 16% slower than a similarly sized CTC model, contradicting the common belief that RNN-T models are not suitable for high throughput inference. The implementation is available in NVIDIA NeMo.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2406.03791

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Hardware (0.88)
Information Technology > Graphics (0.88)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.42)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition

Galvez, Daniel, Kaldewey, Tim

arXiv.org Artificial IntelligenceNov-8-2023

While Connectionist Temporal Classification (CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines, their performance has been limited by CPU-based beam search decoding. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam search decoder compatible with current CTC models. It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition. We provide pre-built DLPack-based python bindings for ease of use with Python-based machine learning frameworks at https://github.com/nvidia-riva/riva-asrlib-decoder. We evaluated our decoder for offline and online scenarios, demonstrating that it is the fastest beam search decoder for CTC models. In the offline scenario it achieves up to 7 times more throughput than the current state-of-the-art CPU decoder and in the online streaming scenario, it achieves nearly 8 times lower latency, with same or better word error rate.

artificial intelligence, gpu-accelerated wfst beam search decoder, machine learning, (1 more...)

arXiv.org Artificial Intelligence

2311.04996

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.60)

Add feedback

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

Gómez, Rafael Mosquera, Eusse, Julián, Ciro, Juan, Galvez, Daniel, Hileman, Ryan, Bollacker, Kurt, Kanter, David

arXiv.org Artificial IntelligenceAug-29-2023

The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.

artificial intelligence, language multilingual speech dataset, natural language, (3 more...)

arXiv.org Artificial Intelligence

2308.1571

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.53)

Add feedback

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Galvez, Daniel, Diamos, Greg, Ciro, Juan, Cerón, Juan Felipe, Achorn, Keith, Gopi, Anjali, Kanter, David, Lam, Maximilian, Mazumder, Mark, Reddi, Vijay Janapa

arXiv.org Machine LearningNov-17-2021

The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set. Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.

artificial intelligence, machine learning, speech recognition, (15 more...)

arXiv.org Machine Learning

2111.09344

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback