AITopics | whisper model

Collaborating Authors

whisper model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Arabic Little STT: Arabic Children Speech Recognition Dataset

Alkadri, Mouhand, Desouki, Dania, Jallad, Khloud Al

arXiv.org Artificial IntelligenceOct-28-2025

The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children's demographic representation in ASR datasets.

artificial intelligence, machine learning, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

2510.23319

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Online (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages

Mbonimpa, Pacome Simon, Tuyizere, Diane, Biyabani, Azizuddin Ahmed, Tonguz, Ozan K.

arXiv.org Artificial IntelligenceOct-21-2025

Abstract--This paper presents a novel framework for speech transcription and synthesis, leveraging edge-cloud parallelism to enhance processing speed and accessibility for Kinyarwanda and Swahili speakers. It addresses the scarcity of powerful language processing tools for these widely spoken languages in East African countries with limited technological infrastructure. The framework utilizes the Whisper and SpeechT5 pre-trained models to enable speech-to-text (STT) and text-to-speech (TTS) translation. The architecture uses a cascading mechanism that distributes the model inference workload between the edge device and the cloud, thereby reducing latency and resource usage, benefiting both ends. On the edge device, our approach achieves a memory usage compression of 9.5% for the SpeechT5 model and 14% for the Whisper model, with a maximum memory usage of 149 MB. Experimental results indicate that on a 1.7 GHz CPU edge device with a 1 MB/s network bandwidth, the system can process a 270-character text in less than a minute for both speech-to-text and text-to-speech transcription. Using real-world survey data from Kenya, it is shown that the cascaded edge-cloud architecture proposed could easily serve as an excellent platform for STT and TTS transcription with good accuracy and response time. I. INTRODUCTION In today's digital age, the need for accurate and efficient speech transcription and synthesis models has been increasing rapidly. These models play an important role in a variety of applications, such as learning new language(s), accessibility tools for people with difficulties in reading and hearing, as well as automated voice assistants [1]. Kinyarwanda and Swahili are two of the local languages spoken in East Africa. While Swahili is the most widely spoken language in Eastern Africa, the speakers range from 60 million to over 150 million [2].

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.16497

Country: Africa > East Africa (0.54)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models

Mudi, Prasenjit K, Sachan, Anshi, Devapriya, Dahlia, Kalyani, Sheetal

arXiv.org Artificial IntelligenceOct-15-2025

ABSTRACT Whisper models have achieved remarkable progress in speech recognition; yet their large size remains a bottleneck for deployment on resource-constrained edge devices. This paper proposes a framework to design fine-tuned variants of Whisper which address the above problem. Structured sparsity is enforced via the Sparse Group LASSO penalty as a loss regu-larizer, to reduce the number of FLOating Point operations (FLOPs). Further, a weight statistics aware pruning algorithm is proposed. On Common V oice 11.0 Hindi dataset, we obtain, without degrading WER, (a) 35.4% reduction in model parameters, 14.25% lower memory consumption and 18.5% fewer FLOPs on Whisper-small, and (b) 31% reduction in model parameters, 15.29% lower memory consumption and 16.95% fewer FLOPs on Whisper-medium; and, (c) substantially outperform the state-of-the-art Iterative Magnitude Pruning based method by pruning 18.7% more parameters along with a 12.31 reduction in WER.

artificial intelligence, machine learning, pruning, (19 more...)

arXiv.org Artificial Intelligence

2510.12666

Country:

Asia (0.28)
North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.70)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)

Add feedback

ASR Under Noise: Exploring Robustness for Sundanese and Javanese

Pranida, Salsabila Zahirah, Airlangga, Muhammad Cendekia, Genadi, Rifo Ahmad, Shehata, Shady

arXiv.org Artificial IntelligenceOct-1-2025

We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.25878

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Automatic Speech Recognition for Greek Medical Dictation

Georgilas, Vardis, Stafylakis, Themos

arXiv.org Artificial IntelligenceSep-30-2025

Medical dictation systems are essential tools in modern healthcare, enabling accurate and efficient conversion of speech into written medical documentation. The main objective of this paper is to create a domain-specific system for Greek medical speech transcriptions. The ultimate goal is to assist healthcare professionals by reducing the overload of manual documentation and improving workflow efficiency. Towards this goal, we develop a system that combines automatic speech recognition techniques with text correction model, allowing better handling of domain-specific terminology and linguistic variations in Greek. Our approach leverages both acoustic and textual modeling to create more realistic and reliable transcriptions. We focused on adapting existing language and speech technologies to the Greek medical context, addressing challenges such as complex medical terminology and linguistic inconsistencies. Through domain-specific fine-tuning, our system achieves more accurate and coherent transcriptions, contributing to the development of practical language technologies for the Greek healthcare sector.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.2355

Country: North America > United States > Pennsylvania (0.14)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning

Yang, Hongli, Peng, Yizhou, Huang, Hao, Li, Sheng

arXiv.org Artificial IntelligenceSep-29-2025

Large-scale multilingual ASR models like Whisper excel in high-resource settings but face challenges in low-resource scenarios, such as rare languages and code-switching (CS), due to computational costs and catastrophic forgetting. We explore Soft Prompt Tuning (SPT), a parameter-efficient method to enhance CS ASR while preserving prior knowledge. We evaluate two strategies: (1) full fine-tuning (FFT) of both soft prompts and the entire Whisper model, demonstrating improved cross-lingual capabilities compared to traditional methods, and (2) adhering to SPT's original design by freezing model parameters and only training soft prompts. Additionally, we introduce SPT4ASR, a combination of different SPT variants. Experiments on the SEAME and ASRU2019 datasets show that deep prompt tuning is the most effective SPT approach, and our SPT4ASR methods achieve further error reductions in CS ASR, maintaining parameter efficiency similar to LoRA, without degrading performance on existing languages.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-2549

2506.21576

Country: Asia > Japan (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

Božík, Erik, Šuppa, Marek

arXiv.org Artificial IntelligenceSep-24-2025

Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70\%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.

machine learning, natural language, transcript, (19 more...)

arXiv.org Artificial Intelligence

2509.1927

Country: Europe > Slovakia (0.29)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

King, Evan, Sabra, Adam, Kudlur, Manjunath, Wang, James, Warden, Pete

arXiv.org Artificial IntelligenceSep-3-2025

We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.

artificial intelligence, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2509.02523

Genre: Research Report (0.82)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.37)

Add feedback

Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation

Liu, Changsong, Peng, Yizhou, Chng, Eng Siong

arXiv.org Artificial IntelligenceSep-3-2025

Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. Subsequently, any recognized variant is mapped back to the original rare word in the final transcription. The evaluation results on the LibriSpeech dataset show that our method reduces biased-word error rate (B-WER) by 43% on test-clean and 44% on test-other while maintaining unbiased-WER (U-WER) essentially unchanged.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.17796

Country: Asia (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts

Gao, Lingyun, Tejedor-Garcia, Cristian, Cucchiarini, Catia, Strik, Helmer

arXiv.org Artificial IntelligenceSep-1-2025

Automatic reading aloud evaluation can provide valuable support to teachers by enabling more efficient scoring of reading exercises. However, research on reading evaluation systems and applications remains limited. We present a novel multimodal approach that leverages audio and knowledge from text resources. In particular, we explored the potential of using Whisper and instruction - tuned large language models (LLMs) with prompts to improve transcriptions for child speech recognition, as well as their effectiveness in downstream reading mistake detection. Our results demonstrate the effectiveness of prompting Whisper and prompting LLM, compared to the baseline Whisper model without prompting. The best performing system achieved state - of - the - art recognition performance in Dutch child read speech, with a word error rate (WER) of 5.1%, improving the baseline WER of 9.4%. Furthermore, it significantly improved reading mistake detection, increasing the F1 score from 0.39 to 0.73.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-658

2506.11079

Country: Europe > Estonia (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback