AITopics | Picheny, Michael

Collaborating Authors

Picheny, Michael

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Improving Joint Speech-Text Representations Without Alignment

Peyser, Cal, Meng, Zhong, Hu, Ke, Prabhavalkar, Rohit, Rosenberg, Andrew, Sainath, Tara N., Picheny, Michael, Cho, Kyunghyun

arXiv.org Artificial IntelligenceAug-11-2023

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2308.06125

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback

A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Peyser, Cal, Picheny, Michael, Cho, Kyunghyun, Prabhavalkar, Rohit, Huang, Ronny, Sainath, Tara

arXiv.org Artificial IntelligenceApr-19-2023

Unlike previous work, we apply these methods to a state-of-the-art, 160M-parameter streaming Conformer [7] Unpaired text and audio injection have emerged as dominant methods model that is already trained on a very large supervised corpus. We for improving ASR performance in the absence of a large labeled further depart from previous work by training supervised and unsupervised corpus. However, little guidance exists on deploying these methods tasks jointly, which is being increasingly shown to be to improve production ASR systems that are trained on very large supervised preferable to the conventional fine-tuning approach on very large corpora and with realistic requirements like a constrained datasets [8]. We find that under these conditions, none of the studied model size and CPU budget, streaming capability, and a rich lattice methods improve general WER at all. However, we report improvements for rescoring and for downstream NLU tasks. In this work, we compare in the decoder's computational load and in lattice density, three state-of-the-art semi-supervised methods encompassing as well as in several targeted WER measurements assessing performance both unpaired text and audio as well as several of their combinations on known categories of particularly difficult utterances. in a controlled setting using joint training. We find that in our setting Through this comparison and analysis, we hope to offer a more nuanced these methods offer many improvements beyond raw WER, including and comprehensive view of the usefulness of unpaired audio substantial gains in tail-word WER, decoder computation during and text in industrial ASR.

artificial intelligence, machine learning, representation, (19 more...)

arXiv.org Artificial Intelligence

2304.11053

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

Dual Learning for Large Vocabulary On-Device ASR

Peyser, Cal, Huang, Ronny, Sainath, Tara, Prabhavalkar, Rohit, Picheny, Michael, Cho, Kyunghyun

arXiv.org Artificial IntelligenceJan-11-2023

Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address only the case of using unpaired examples to compensate for very small supervised datasets, and mostly on large, non-streaming models. Dual learning has not yet been proven effective for using unsupervised data to improve realistic on-device streaming models that are already trained on large supervised corpora. We provide this missing piece though an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2301.04327

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Improving Efficiency in Large-Scale Decentralized Distributed Training

Zhang, Wei, Cui, Xiaodong, Kayi, Abdullah, Liu, Mingrui, Finkler, Ulrich, Kingsbury, Brian, Saon, George, Mroueh, Youssef, Buyuktosunoglu, Alper, Das, Payel, Kung, David, Picheny, Michael

arXiv.org Machine LearningFeb-3-2020

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5% WER on the Hub5-2000 Switchboard (SWB) test set and 13.3% WER on the CallHome (CH) test set using 64 V100 GPUs and in 1.98 hours with 7.7% WER on SWB and 13.3% WER on CH using 128 V100 GPUs, the fastest training time reported to date. Index T erms -- distributed training, decentralized SGD, parallel computing, automatic speech recognition, image recognition.

deep learning, learner, neural network, (18 more...)

arXiv.org Machine Learning

2002.01119

Genre: Research Report (0.40)

Industry: Information Technology (0.51)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Zhang, Wei, Cui, Xiaodong, Finkler, Ulrich, Saon, George, Kayi, Abdullah, Buyuktosunoglu, Alper, Kingsbury, Brian, Kung, David, Picheny, Michael

arXiv.org Machine LearningJul-10-2019

Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than commonly used Synchronous SGD (SSGD) algorithm. On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale. Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD algorithm among themselves. On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6% WER on the Hub5-2000 Switchboard (SWB) test-set and a 13.2% WER on the Call-home (CH) test-set in 5.2 hours. To the best of our knowledge, this is the fastest ASR training system that attains this level of model accuracy for SWB-2000 task to be ever reported in the literature.

adpsgd, deep learning, speech recognition, (19 more...)

arXiv.org Machine Learning

1907.05701

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Industry: Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Distributed Deep Learning Strategies For Automatic Speech Recognition

Zhang, Wei, Cui, Xiaodong, Finkler, Ulrich, Kingsbury, Brian, Saon, George, Kung, David, Picheny, Michael

arXiv.org Machine LearningApr-9-2019

In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the proper hyper-parameters (e.g., learning rate) to enable the training with sufficiently large batch size without impairing the model accuracy. We then implement various distributed strategies, including Synchronous (SYNC), Asynchronous Decentralized Parallel SGD (ADPSGD) and the hybrid of the two HYBRID, to study their runtime/accuracy trade-off. We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7.6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13.1% WER on the CallHome (CH) test set. Furthermore, we can train the model using HYBRID in 11.5 hours with 32 NVIDIA V100 GPUs without loss in accuracy.

batch size, deep learning, speech recognition, (19 more...)

arXiv.org Machine Learning

1904.04956

Country:

North America > United States (0.14)
Europe > Italy (0.14)

Genre: Research Report (0.50)

Industry: Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks

Cui, Xiaodong, Zhang, Wei, Tüske, Zoltán, Picheny, Michael

Neural Information Processing SystemsDec-31-2018

We propose a population-based Evolutionary Stochastic Gradient Descent (ESGD) framework for optimizing deep neural networks. ESGD combines SGD and gradient-free evolutionary algorithms as complementary algorithms in one framework in which the optimization alternates between the SGD step and evolution step to improve the average fitness of the population. With a back-off strategy in the SGD step and an elitist strategy in the evolution step, it guarantees that the best fitness in the population will never degrade. In addition, individuals in the population optimized with various SGD-based optimizers using distinct hyper-parameters in the SGD step are considered as competing species in a coevolution setting such that the complementarity of the optimizers is also taken into account. The effectiveness of ESGD is demonstrated across multiple applications including speech recognition, image recognition and language modeling, using networks with a variety of deep architectures.

artificial intelligence, deep learning, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Genre: Research Report (0.68)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks

Cui, Xiaodong, Zhang, Wei, Tüske, Zoltán, Picheny, Michael

Neural Information Processing SystemsDec-31-2018

We propose a population-based Evolutionary Stochastic Gradient Descent (ESGD) framework for optimizing deep neural networks. ESGD combines SGD and gradient-free evolutionary algorithms as complementary algorithms in one framework inwhich the optimization alternates between the SGD step and evolution step to improve the average fitness of the population. With a back-off strategy in the SGD step and an elitist strategy in the evolution step, it guarantees that the best fitness in the population will never degrade. In addition, individuals in the population optimized with various SGD-based optimizers using distinct hyperparameters inthe SGD step are considered as competing species in a coevolution setting such that the complementarity of the optimizers is also taken into account. The effectiveness of ESGD is demonstrated across multiple applications including speech recognition, image recognition and language modeling, using networks with a variety of deep architectures.

deep learning, esgd, neural network, (17 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Genre: Research Report (0.68)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks

Cui, Xiaodong, Zhang, Wei, Tüske, Zoltán, Picheny, Michael

arXiv.org Machine LearningOct-15-2018

deep learning, esgd, neural network, (17 more...)

arXiv.org Machine Learning

1810.06773

Country:

North America > United States (0.14)
North America > Canada (0.14)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Building competitive direct acoustics-to-word models for English conversational speech recognition

Audhkhasi, Kartik, Kingsbury, Brian, Ramabhadran, Bhuvana, Saon, George, Picheny, Michael

arXiv.org Machine LearningDec-8-2017

Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple. Prior work has shown that A2W models require orders of magnitude more training data in order to perform comparably to conventional models. Our work also showed this accuracy gap when using the English Switchboard-Fisher data set. This paper describes a recipe to train an A2W model that closes this gap and is at-par with state-of-the-art sub-word based models. We achieve a word error rate of 8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder or language model. We find that model initialization, training data order, and regularization have the most impact on the A2W model performance. Next, we present a joint word-character A2W model that learns to first spell the word and then recognize it. This model provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.

deep learning, neural network, speech recognition, (19 more...)

arXiv.org Machine Learning

1712.03133

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.88)

Add feedback