AITopics | Pratap, Vineel

Collaborating Authors

Pratap, Vineel

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

Yan, Brian, Pratap, Vineel, Watanabe, Shinji, Auli, Michael

arXiv.org Artificial IntelligenceSep-26-2024

Multilingual Automatic Speech Recognition (ASR) models are typically evaluated in a setting where the ground-truth language of the speech utterance is known, however, this is often not the case for most practical settings. Automatic Spoken Language Identification (SLID) models are not perfect and misclassifications have a substantial impact on the final ASR accuracy. In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. Our results on FLEURS using the MMS and Whisper models show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively and word error rates which are 3.3% and 2.0% lower on these benchmarks.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2409.18428

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Less Peaky and More Accurate CTC Forced Alignment by Label Priors

Huang, Ruizhe, Zhang, Xiaohui, Ni, Zhaoheng, Sun, Li, Hira, Moto, Hwang, Jeff, Manohar, Vimal, Pratap, Vineel, Wiesner, Matthew, Watanabe, Shinji, Povey, Daniel, Khudanpur, Sanjeev

arXiv.org Artificial IntelligenceJun-15-2024

Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.0256

Country:

North America > Canada > Quebec > Montreal (0.25)
Asia > China (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

Scaling Speech Technology to 1,000+ Languages

Pratap, Vineel, Tjandra, Andros, Shi, Bowen, Tomasello, Paden, Babu, Arun, Kundu, Sayani, Elkahky, Ali, Ni, Zhaoheng, Vyas, Apoorv, Fazel-Zarandi, Maryam, Baevski, Alexei, Adi, Yossi, Zhang, Xiaohui, Hsu, Wei-Ning, Conneau, Alexis, Auli, Michael

arXiv.org Artificial IntelligenceMay-22-2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.13516

Country: Europe (0.67)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Star Temporal Classification: Sequence Classification with Partially Labeled Data

Pratap, Vineel, Hannun, Awni, Synnaeve, Gabriel, Collobert, Ronan

arXiv.org Machine LearningJan-28-2022

We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.

artificial intelligence, classification, machine learning, (17 more...)

arXiv.org Machine Learning

2201.12208

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Flashlight: Enabling Innovation in Tools for Machine Learning

Kahn, Jacob, Pratap, Vineel, Likhomanenko, Tatiana, Xu, Qiantong, Hannun, Awni, Cai, Jeff, Tomasello, Paden, Lee, Ann, Grave, Edouard, Avidov, Gilad, Steiner, Benoit, Liptchinsky, Vitaliy, Synnaeve, Gabriel, Collobert, Ronan

arXiv.org Artificial IntelligenceJan-28-2022

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together.

artificial intelligence, machine learning, neural network, (19 more...)

arXiv.org Artificial Intelligence

2201.12465

Country: North America > United States > California (0.15)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Differentiable Weighted Finite-State Transducers

Hannun, Awni, Pratap, Vineel, Kahn, Jacob, Hsu, Wei-Ning

arXiv.org Machine LearningOct-2-2020

E B. (2) The primary difference between ASG and CTC is the inclusion of a blank token, b, represented by the graph in figure 3a. Constructing CTC amounts to including the blank token graph when constructing the full token graph T. The intersection T Y then results in the CTC alignment graph (Figure 1b). Note, this version of CTC does not force transitions on b between repeats tokens. This requires remembering the previous state and hence is more involved (see Appendix A.1 for details). A benefit of constructing sequence-level criteria by composing operations on simpler graphs is the access to a large design space of loss functions with which we can encode useful priors. For example we could construct a "spike" CTC, a "duration-limited" CTC, or an "equally spaced" CTC by substituting the appropriate token graphs into equation 2 (see Appendix A.2 for details).

deep learning, graph, speech recognition, (21 more...)

arXiv.org Machine Learning

2010.01003

Country:

North America > United States (0.14)
Europe > Germany (0.14)

Genre:

Instructional Material (0.46)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
(2 more...)

Add feedback