AITopics | speech recognition system

Collaborating Authors

speech recognition system

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Neural Information Processing SystemsMar-17-2026, 16:48:12 GMT

Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.

artificial intelligence, proceedings, speech recognition, (6 more...)

Neural Information Processing Systems

Genre: Research Report (0.98)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture

Singh, Karamvir

arXiv.org Artificial IntelligenceDec-11-2025

Modern automatic speech recognition systems have achieved remarkable performance through deep learning architectures, particularly models based on self-supervised learning paradigms. However, real-world deployment scenarios frequently involve challenging acoustic environments where background disturbances significantly compromise recognition accuracy. When processing audio containing substantial non-speech content, conventional systems often generate incoherent outputs, leading to elevated error rates that undermine practical utility. The fundamental challenge addressed in this work stems from the inability of standard ASR architectures to explicitly differentiate between meaningful speech signals and irrelevant acoustic interference. This limitation manifests as increased word error rates and character error rates when processing audio with poor signal-to-noise characteristics. This paper introduces an augmented architecture that extends the wav2vec2 model by incorporating a parallel noise detection pathway. Unlike conventional approaches that handle noise through preprocessing or post-processing stages, the proposed method integrates noise awareness directly into the feature learning process.

artificial intelligence, machine learning, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2512.08973

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback

A Deep Learning Automatic Speech Recognition Model for Shona Language

Sirora, Leslie Wellington, Mutandavari, Mainford

arXiv.org Artificial IntelligenceJul-30-2025

This study presented the development of a deep learning-based Automatic Speech Recognition system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. The research first explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Second, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network for acoustic modelling and a Long Short-Term Memory network for language modelling. To overcome the scarcity of data, data augmentation techniques and transfer learning were employed. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate of 29%, Phoneme Error Rate of 12%, and an overall accuracy of 74%. These metrics indicated the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This study contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.15680/IJIRCCE.2024.1206001

2507.21331

Country:

Africa (1.00)
North America > United States (0.28)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.31)

Add feedback

Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic

Grigoryan, Lilit, Karpov, Nikolay, Albasiri, Enas, Lavrukhin, Vitaly, Ginsburg, Boris

arXiv.org Artificial IntelligenceJul-21-2025

Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language's complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2507.13977

Country: Asia (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

Xia, Yinfeng, Li, Huiyan, Le, Chenyang, Wang, Manhong, Sun, Yutao, Ma, Xingyang, Qian, Yanmin

arXiv.org Artificial IntelligenceJun-5-2025

Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between latency and quality, making it suitable for various streaming applications.

arxiv preprint arxiv, machine learning, natural language, (12 more...)

arXiv.org Artificial Intelligence

2506.03722

Country: Asia > China (0.15)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems

Rai, Anand, Rahangdale, Satyam, Anand, Utkarsh, Mukherjee, Animesh

arXiv.org Artificial IntelligenceMay-20-2025

Automatic Speech Recognition (ASR) systems have become ubiquitous in everyday applications, yet significant disparities in performance across diverse demographic groups persist. In this work, we introduce the ASR-FAIRBENCH leader-board which is designed to assess both the accuracy and equity of ASR models in real-time. Leveraging the Meta's Fair-Speech dataset, which captures diverse demographic characteristics, we employ a mixed-effects Poisson regression model to derive an overall fairness score. This score is integrated with traditional metrics like Word Error Rate (WER) to compute the Fairness Adjusted ASR Score (FAAS), providing a comprehensive evaluation framework. Our approach reveals significant performance disparities in SOT A ASR models across demographic groups and offers a benchmark to drive the development of more inclusive ASR technologies.

artificial intelligence, disparity, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2505.11572

Country: Asia > India (0.15)

Genre: Research Report > Experimental Study (0.31)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments

Djeffal, Noussaiba, Addou, Djamel, Kheddar, Hamza, Selouani, Sid Ahmed

arXiv.org Artificial IntelligenceMay-6-2025

Addressing the detrimental impact of non-stationary environmental noise on automatic speech recognition (ASR) has been a persistent and significant research focus. Despite advancements, this challenge continues to be a major concern. Recently, data-driven supervised approaches, such as deep neural networks, have emerged as promising alternatives to traditional unsupervised methods. With extensive training, these approaches have the potential to overcome the challenges posed by diverse real-life acoustic environments. In this light, this paper introduces a novel neural framework that incorporates a robust frontend into ASR systems in both clean and noisy environments. Utilizing the Aurora-2 speech database, the authors evaluate the effectiveness of an acoustic feature set for Mel-frequency, employing the approach of transfer learning based on Residual neural network (ResNet). The experimental results demonstrate a significant improvement in recognition accuracy compared to convolutional neural networks (CNN) and long short-term memory (LSTM) networks. They achieved accuracies of 98.94% in clean and 91.21% in noisy mode.

artificial intelligence, machine learning, recognition, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICTIS62692.2024.10894239

2505.01632

Country: Africa > Middle East > Algeria (0.29)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Retrieval-Augmented Speech Recognition Approach for Domain Challenges

Shen, Peng, Lu, Xugang, Kawai, Hisashi

arXiv.org Artificial IntelligenceFeb-21-2025

National Institute of Information and Communications Technology (NICT), Japan peng.shen@nict.go.jp Abstract --Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data. Automatic speech recognition (ASR) techniques have improved significantly due to advancements in system architecture and optimization algorithms [1]-[4].

language model, recognition, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2502.15264

Country:

Asia > Japan (0.24)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DENOASR: Debiasing ASRs through Selective Denoising

Rai, Anand Kumar, Jaiswal, Siddharth D, Prakash, Shubham, Sree, Bendi Pragnya, Mukherjee, Animesh

arXiv.org Artificial IntelligenceOct-22-2024

Automatic Speech Recognition (ASR) systems have been examined and shown to exhibit biases toward particular groups of individuals, influenced by factors such as demographic traits, accents, and speech styles. Noise can disproportionately impact speakers with certain accents, dialects, or speaking styles, leading to biased error rates. In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. We find that a combination of two popular speech denoising techniques, viz. DEMUCS and LE, can be effectively used to mitigate ASR disparity without compromising their overall performance. Experiments using two state-of-the-art open-source ASRs - OpenAI WHISPER and NVIDIA NEMO - on multiple benchmark datasets, including TIE, VOX-POPULI, TEDLIUM, and FLEURS, show that there is a promising reduction in the average word error rate gap across the two gender groups. For a given dataset, the denoising is selectively applied on speech samples having speech intelligibility below a certain threshold, estimated using a small validation sample, thus ameliorating the need for large-scale human-written ground-truth transcripts. Our findings suggest that selective denoising can be an elegant approach to mitigate biases in present-day ASR systems.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.16712

Country:

Asia > India > West Bengal > Kharagpur (0.05)
North America > Canada (0.04)
Europe > Czechia > Liberec Region > Liberec (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automatic Speech Recognition for Biomedical Data in Bengali Language

Kabir, Shariar, Nahar, Nazmun, Saha, Shyamasree, Rashid, Mamunur

arXiv.org Artificial IntelligenceJun-16-2024

Recent advancements in domain specific Automated Speech Recognition (ASR) and Large Language Models (LLM), have significantly boosted the adoption of AI in digital services across many different industries such as financial service, healthcare. In the healthcare industry in particular, integration of AI-driven solutions such as conversational chatbots, voice interactive guidance is opening new avenues to engage patients and healthcare providers ([1], [2]). Many healthcare systems in the developed world have been adopting these systems to increase patient satisfaction. One key shortcomings in this is that the majority of the developments in this domain are focused towards patients of European descent, their medical vocabularies. Many non-European languages, though spoken by millions, have seen very limited advancements. Bengali, despite being the seventh most popular language with 270 million speakers worldwide, has seen very limited progress in Bengali NLP and ASR research. This has hindered the integration of these technologies into digital health services for Bengali speakers which in turn slowed down the adoption of digital health solutions. While many European language speakers are benefiting from AI-driven services (conversational chatbot assisted) like digital appointment booking, symptom reporting before appointment and mental health support, Bengalis speakers are not able to benefit from these advancements. Bengali ASR research has seen a significant surge in recent years, fueled by the release of large public speech corpora like Google's "Large Bengali ASR training data" (LB-ASRTD).

dataset, speech recognition, synthetic data, (12 more...)

arXiv.org Artificial Intelligence

2406.12931

Country:

Asia > Bangladesh (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback