Goto

Collaborating Authors

 speech recognition model


Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

Kashiwagi, Yosuke, Futami, Hayato, Tsunoo, Emiru, Asakawa, Satoshi

arXiv.org Artificial Intelligence

Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.


Apple iPhone's voice-to-text feature periodically shows 'Trump' when user says 'racist'

FOX News

Apple's iPhone voice-to-text feature is sparking controversy after a viral TikTok video showed a user speaking the word "racist," which at first showed up as "Trump" before switching back to "racist." Fox News Digital was able to replicate the issue multiple times. The voice-to-text dictation feature was observed briefly flashing "Trump" when a user said "racist" before it quickly changed back to "racist" – just like in the viral TikTok video. However, "Trump" did not appear every time a user said "racist." The voice-to-text feature also wrote words like "reinhold" and "you" when a user said "racist."


Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities

Adila, Aulia, Lestari, Dessi, Purwarianti, Ayu, Tanaya, Dipta, Azizah, Kurniawati, Sakti, Sakriani

arXiv.org Artificial Intelligence

An ideal speech recognition model has the capability to transcribe speech accurately under various characteristics of speech signals, such as speaking style (read and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building such a model requires a significant amount of training data with diverse speech characteristics. Currently, Indonesian data is dominated by read, formal, and clean speech, leading to a scarcity of Indonesian data with other speech variabilities. To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. We further investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups. The best results were achieved by the Whisper fine-tuned model across datasets with various characteristics, as indicated by the decrease in word error rate (WER) and character error rate (CER). Moreover, we found that speaking style variability affected model performance the most.


Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

Wu, Yihan, Peng, Yifan, Lu, Yichen, Chang, Xuankai, Song, Ruihua, Watanabe, Shinji

arXiv.org Artificial Intelligence

Visual signals can enhance audiovisual speech recognition accuracy by providing additional contextual information. Given the complexity of visual signals, an audiovisual speech recognition model requires robust generalization capabilities across diverse video scenarios, presenting a significant challenge. In this paper, we introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for ``in-the-wild'' videos. Specifically, we first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Then, we build EVA upon a robust pretrained speech recognition model, ensuring its generalization ability. Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module. Experiments show our model achieves state-of-the-art results on three benchmarks, which demonstrates the generalization ability of EVA across diverse video domains.


Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

Dong, Lukuan, Qin, Donghong, Bai, Fengbo, Song, Fanhua, Liu, Yan, Xu, Chen, Ou, Zhijian

arXiv.org Artificial Intelligence

In our practice, it takes non-trivial efforts to collect and transcribe even less than 10 hours of Iu Mien language. The The mainstream automatic speech recognition (ASR) technology development of Iu Mien language speech recognition systems usually requires hundreds to thousands of hours of is very challenging, while it is very important to reduce digital annotated speech data. Three approaches to low-resourced divides and culture inheritance. ASR are phoneme or subword based supervised pre-training, The paradigm of pre-training (PT) followed by fine-tuning and self-supervised pre-training over multilingual data. The (FT), called the PTFT paradigm, has emerged in recent years as Iu Mien language is the main ethnic language of the Yao an effective way to solve the problem of limited training data for ethnic group in China and is low-resourced in the sense that low-resource languages for ASR. In pre-training, training data the annotated speech is very limited. With less than 10 hours for a number of languages are merged to train a multilingual of transcribed Iu Mien language, this paper investigates and model. The pre-trained model can then serve as a backbone, compares the three approaches for Iu Mien speech recognition.


Gated Low-rank Adaptation for personalized Code-Switching Automatic Speech Recognition on the low-spec devices

Kim, Gwantae, Lee, Bokyeung, Kim, Donghyeon, Ko, Hanseok

arXiv.org Artificial Intelligence

In recent times, there has been a growing interest in utilizing personalized large models on low-spec devices, such as mobile and CPU-only devices. However, utilizing a personalized large model in the on-device is inefficient, and sometimes limited due to computational cost. To tackle the problem, this paper presents the weights separation method to minimize on-device model weights using parameter-efficient fine-tuning methods. Moreover, some people speak multiple languages in an utterance, as known as code-switching, the personalized ASR model is necessary to address such cases. However, current multilingual speech recognition models are limited to recognizing a single language within each utterance. To tackle this problem, we propose code-switching speech recognition models that incorporate fine-tuned monolingual and multilingual speech recognition models. Additionally, we introduce a gated low-rank adaptation(GLoRA) for parameter-efficient fine-tuning with minimal performance degradation. Our experiments, conducted on Korean-English code-switching datasets, demonstrate that fine-tuning speech recognition models for code-switching surpasses the performance of traditional code-switching speech recognition models trained from scratch. Furthermore, GLoRA enhances parameter-efficient fine-tuning performance compared to conventional LoRA.


Mi-Go: Test Framework which uses YouTube as Data Source for Evaluating Speech Recognition Models like OpenAI's Whisper

Wojnar, Tomasz, Hryszko, Jaroslaw, Roman, Adam

arXiv.org Artificial Intelligence

This article introduces Mi-Go, a novel testing framework aimed at evaluating the performance and adaptability of general-purpose speech recognition machine learning models across diverse real-world scenarios. The framework leverages YouTube as a rich and continuously updated data source, accounting for multiple languages, accents, dialects, speaking styles, and audio quality levels. To demonstrate the effectiveness of the framework, the Whisper model, developed by OpenAI, was employed as a test object. The tests involve using a total of 124 YouTube videos to test all Whisper model versions. The results underscore the utility of YouTube as a valuable testing platform for speech recognition models, ensuring their robustness, accuracy, and adaptability to diverse languages and acoustic conditions. Additionally, by contrasting the machine-generated transcriptions against human-made subtitles, the Mi-Go framework can help pinpoint potential misuse of YouTube subtitles, like Search Engine Optimization.


Research on an improved Conformer end-to-end Speech Recognition Model with R-Drop Structure

Ji, Weidong, Zan, Shijie, Zhou, Guohui, Wang, Xu

arXiv.org Artificial Intelligence

To address the issue of poor generalization ability in end-to-end speech recognition models within deep learning, this study proposes a new Conformer-based speech recognition model called "Conformer-R" that incorporates the R-drop structure. This model combines the Conformer model, which has shown promising results in speech recognition, with the R-drop structure. By doing so, the model is able to effectively model both local and global speech information while also reducing overfitting through the use of the R-drop structure. This enhances the model's ability to generalize and improves overall recognition efficiency. The model was first pre-trained on the Aishell1 and Wenetspeech datasets for general domain adaptation, and subsequently fine-tuned on computer-related audio data. Comparison tests with classic models such as LAS and Wenet were performed on the same test set, demonstrating the Conformer-R model's ability to effectively improve generalization.


Towards the Transferable Audio Adversarial Attack via Ensemble Methods

Guo, Feng, Sun, Zheng, Chen, Yuxuan, Ju, Lei

arXiv.org Artificial Intelligence

In recent years, deep learning (DL) models have achieved significant progress in many domains, such as autonomous driving, facial recognition, and speech recognition. However, the vulnerability of deep learning models to adversarial attacks has raised serious concerns in the community because of their insufficient robustness and generalization. Also, transferable attacks have become a prominent method for black-box attacks. In this work, we explore the potential factors that impact adversarial examples (AEs) transferability in DL-based speech recognition. We also discuss the vulnerability of different DL systems and the irregular nature of decision boundaries. Our results show a remarkable difference in the transferability of AEs between speech and images, with the data relevance being low in images but opposite in speech recognition. Motivated by dropout-based ensemble approaches, we propose random gradient ensembles and dynamic gradient-weighted ensembles, and we evaluate the impact of ensembles on the transferability of AEs. The results show that the AEs created by both approaches are valid for transfer to the black box API.


Conformer-1: a robust speech recognition model

#artificialintelligence

The Conformer [1] is a neural net for speech recognition that was published by Google Brain in 2020. The Conformer builds upon the now-ubiquitous Transformer architecture [2], which is famous for its parallelizability and heavy use of the attention mechanism. By integrating convolutional layers into the Transformer architecture, the Conformer can capture both local and global dependencies while being a relatively size-efficient neural net architecture. While the Conformer architecture has shown state-of-the-art performance in speech recognition, its main downside lies in its computational and memory efficiency. The core usage of the attention mechanism in Conformer, essential to capture and retain long-term information in an input sequence, is in fact well-known to be a computational bottleneck.