AITopics | Khare, Yash

Collaborating Authors

Khare, Yash

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Universal-2-TF: Robust All-Neural Text Formatting for ASR

Khare, Yash, Peyash, Taufiquzzaman, Vanzo, Andrea, Yoshioka, Takuya

arXiv.org Artificial IntelligenceJan-10-2025

This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.05948

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Anatomy of Industrial Scale Multilingual ASR

Ramirez, Francis McCann, Chkhetiani, Luka, Ehrenberg, Andrew, McHardy, Robert, Botros, Rami, Khare, Yash, Vanzo, Andrea, Peyash, Taufiquzzaman, Oexle, Gabriel, Liang, Michael, Sklyar, Ilya, Fakhan, Enver, Etefy, Ahmed, McCrystal, Daniel, Flamini, Sam, Donato, Domenic, Yoshioka, Takuya

arXiv.org Artificial IntelligenceApr-16-2024

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2404.09841

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Zhang, Kevin, Chkhetiani, Luka, Ramirez, Francis McCann, Khare, Yash, Vanzo, Andrea, Liang, Michael, Martin, Sergio Ramirez, Oexle, Gabriel, Bousbib, Ruben, Peyash, Taufiquzzaman, Nguyen, Michael, Pulliam, Dillon, Donato, Domenic

arXiv.org Artificial IntelligenceApr-12-2024

These labels are then used in traditional supervised training schemas. This line of work in turn bifurcates This paper presents Conformer-1, an end-to-end Automatic into two main approaches. The first approach relies on generating Speech Recognition (ASR) model trained on an extensive pseudo-labels using a pre-existing baseline model [1, 6, 7], dataset of 570k hours of speech audio data, 91% of which was while the second approach attempts to source massive amounts acquired from publicly available sources. To achieve this, we of data of ambiguous quality from the public sources and then perform Noisy Student Training [1] after generating pseudolabels filter it down to a subset that is both human labeled and high for the unlabeled public data using a strong Conformer quality [8]. Our work attempts to address the data scarcity issue RNN-T baseline model. The addition of these pseudo-labeled head-on and leverages both data filtering and pseudo-labeling data results in remarkable improvements in relative Word Error to procure high-quality audio and labels at scale. Rate (WER) by 11.5% and 24.3% for our asynchronous and Following the example provided by Whisper [8], we realtime models, respectively. Additionally, the model is more sourced audio speech data from open and fair use sources available robust to background noise owing to the addition of these data.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2404.07341

Country: Europe > France (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback