AITopics | commonvoice

Collaborating Authors

commonvoice

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Early Prediction of Self-Supervised Speech Model Performance

Whetten, Ryan, Maison, Lucas, Parcollet, Titouan, Dinarelli, Marco, Estève, Yannick

arXiv.org Artificial IntelligenceJan-10-2025

In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.

artificial intelligence, correlation, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.05966

Country:

Europe > France (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

We Augmented Whisper With kNN and You Won't Believe What Came Next

Nachesa, Maya K., Niculae, Vlad

arXiv.org Artificial IntelligenceOct-24-2024

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

datastore, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.1885

Country:

Europe > Belgium (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Li, Song, You, Yongbin, Wang, Xuezhi, Tian, Zhengkun, Ding, Ke, Wan, Guanglu

arXiv.org Artificial IntelligenceJun-26-2024

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace, and we believe that such a large corpus will pave new avenues for research in multilingual ASR.

arxiv preprint arxiv, commonvoice, corpus, (13 more...)

arXiv.org Artificial Intelligence

2406.18301

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Universal Automatic Phonetic Transcription into the International Phonetic Alphabet

Taguchi, Chihiro, Sakai, Yusuke, Haghani, Parisa, Chiang, David

arXiv.org Artificial IntelligenceAug-7-2023

This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use training data from seven languages from CommonVoice 11.0, transcribed into IPA semi-automatically. Although this training dataset is much smaller than Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or better results. Furthermore, we show that the quality of our universal speech-to-IPA models is close to that of human annotators.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2308.03917

Country:

Europe > Germany (0.14)
Asia > Myanmar > Chin State > Hakha (0.05)
North America > United States > Indiana > St. Joseph County > Notre Dame (0.04)
(5 more...)

Genre: Research Report > New Finding (0.47)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.48)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.47)

Add feedback

Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition

Parcollet, Titouan, van Dalen, Rogier, Zhang, Shucong, Bhattacharya, Sourav

arXiv.org Artificial IntelligenceJul-12-2023

Modern speech recognition systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference as well as training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but fail to consistently reach the same level of accuracy. In practice, however, the self-attention weights of trained speech recognizers take the form of a global average over time. This paper, therefore, proposes a linear-time alternative to self-attention for speech recognition. It summarises a whole utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method ``Summary Mixing''. Introducing Summary Mixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while lowering the training and inference times by up to 27% and reducing the memory budget by a factor of two.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2307.07421

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

commonvoice

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

83d349b6eb8125588b5f091e2d47525c-Supplemental-Conference.pdf

Towards Early Prediction of Self-Supervised Speech Model Performance

We Augmented Whisper With kNN and You Won't Believe What Came Next

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Universal Automatic Phonetic Transcription into the International Phonetic Alphabet

Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition