AITopics | Chkhetiani, Luka

Collaborating Authors

Chkhetiani, Luka

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Anatomy of Industrial Scale Multilingual ASR

Ramirez, Francis McCann, Chkhetiani, Luka, Ehrenberg, Andrew, McHardy, Robert, Botros, Rami, Khare, Yash, Vanzo, Andrea, Peyash, Taufiquzzaman, Oexle, Gabriel, Liang, Michael, Sklyar, Ilya, Fakhan, Enver, Etefy, Ahmed, McCrystal, Daniel, Flamini, Sam, Donato, Domenic, Yoshioka, Takuya

arXiv.org Artificial IntelligenceApr-16-2024

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2404.09841

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Zhang, Kevin, Chkhetiani, Luka, Ramirez, Francis McCann, Khare, Yash, Vanzo, Andrea, Liang, Michael, Martin, Sergio Ramirez, Oexle, Gabriel, Bousbib, Ruben, Peyash, Taufiquzzaman, Nguyen, Michael, Pulliam, Dillon, Donato, Domenic

arXiv.org Artificial IntelligenceApr-12-2024

These labels are then used in traditional supervised training schemas. This line of work in turn bifurcates This paper presents Conformer-1, an end-to-end Automatic into two main approaches. The first approach relies on generating Speech Recognition (ASR) model trained on an extensive pseudo-labels using a pre-existing baseline model [1, 6, 7], dataset of 570k hours of speech audio data, 91% of which was while the second approach attempts to source massive amounts acquired from publicly available sources. To achieve this, we of data of ambiguous quality from the public sources and then perform Noisy Student Training [1] after generating pseudolabels filter it down to a subset that is both human labeled and high for the unlabeled public data using a strong Conformer quality [8]. Our work attempts to address the data scarcity issue RNN-T baseline model. The addition of these pseudo-labeled head-on and leverages both data filtering and pseudo-labeling data results in remarkable improvements in relative Word Error to procure high-quality audio and labels at scale. Rate (WER) by 11.5% and 24.3% for our asynchronous and Following the example provided by Whisper [8], we realtime models, respectively. Additionally, the model is more sourced audio speech data from open and fair use sources available robust to background noise owing to the addition of these data.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2404.07341

Country: Europe > France (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback