Goto

Collaborating Authors

 speech recognition system


Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture

Singh, Karamvir

arXiv.org Artificial Intelligence

Modern automatic speech recognition systems have achieved remarkable performance through deep learning architectures, particularly models based on self-supervised learning paradigms. However, real-world deployment scenarios frequently involve challenging acoustic environments where background disturbances significantly compromise recognition accuracy. When processing audio containing substantial non-speech content, conventional systems often generate incoherent outputs, leading to elevated error rates that undermine practical utility. The fundamental challenge addressed in this work stems from the inability of standard ASR architectures to explicitly differentiate between meaningful speech signals and irrelevant acoustic interference. This limitation manifests as increased word error rates and character error rates when processing audio with poor signal-to-noise characteristics. This paper introduces an augmented architecture that extends the wav2vec2 model by incorporating a parallel noise detection pathway. Unlike conventional approaches that handle noise through preprocessing or post-processing stages, the proposed method integrates noise awareness directly into the feature learning process.


Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Neural Information Processing Systems

Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.


A Deep Learning Automatic Speech Recognition Model for Shona Language

Sirora, Leslie Wellington, Mutandavari, Mainford

arXiv.org Artificial Intelligence

This study presented the development of a deep learning-based Automatic Speech Recognition system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. The research first explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Second, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network for acoustic modelling and a Long Short-Term Memory network for language modelling. To overcome the scarcity of data, data augmentation techniques and transfer learning were employed. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate of 29%, Phoneme Error Rate of 12%, and an overall accuracy of 74%. These metrics indicated the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This study contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.


Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic

Grigoryan, Lilit, Karpov, Nikolay, Albasiri, Enas, Lavrukhin, Vitaly, Ginsburg, Boris

arXiv.org Artificial Intelligence

Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language's complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.


MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

Xia, Yinfeng, Li, Huiyan, Le, Chenyang, Wang, Manhong, Sun, Yutao, Ma, Xingyang, Qian, Yanmin

arXiv.org Artificial Intelligence

Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between latency and quality, making it suitable for various streaming applications.


Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments

Djeffal, Noussaiba, Addou, Djamel, Kheddar, Hamza, Selouani, Sid Ahmed

arXiv.org Artificial Intelligence

Addressing the detrimental impact of non-stationary environmental noise on automatic speech recognition (ASR) has been a persistent and significant research focus. Despite advancements, this challenge continues to be a major concern. Recently, data-driven supervised approaches, such as deep neural networks, have emerged as promising alternatives to traditional unsupervised methods. With extensive training, these approaches have the potential to overcome the challenges posed by diverse real-life acoustic environments. In this light, this paper introduces a novel neural framework that incorporates a robust frontend into ASR systems in both clean and noisy environments. Utilizing the Aurora-2 speech database, the authors evaluate the effectiveness of an acoustic feature set for Mel-frequency, employing the approach of transfer learning based on Residual neural network (ResNet). The experimental results demonstrate a significant improvement in recognition accuracy compared to convolutional neural networks (CNN) and long short-term memory (LSTM) networks. They achieved accuracies of 98.94% in clean and 91.21% in noisy mode.


Retrieval-Augmented Speech Recognition Approach for Domain Challenges

Shen, Peng, Lu, Xugang, Kawai, Hisashi

arXiv.org Artificial Intelligence

National Institute of Information and Communications Technology (NICT), Japan peng.shen@nict.go.jp Abstract --Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data. Automatic speech recognition (ASR) techniques have improved significantly due to advancements in system architecture and optimization algorithms [1]-[4].


DENOASR: Debiasing ASRs through Selective Denoising

Rai, Anand Kumar, Jaiswal, Siddharth D, Prakash, Shubham, Sree, Bendi Pragnya, Mukherjee, Animesh

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) systems have been examined and shown to exhibit biases toward particular groups of individuals, influenced by factors such as demographic traits, accents, and speech styles. Noise can disproportionately impact speakers with certain accents, dialects, or speaking styles, leading to biased error rates. In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. We find that a combination of two popular speech denoising techniques, viz. DEMUCS and LE, can be effectively used to mitigate ASR disparity without compromising their overall performance. Experiments using two state-of-the-art open-source ASRs - OpenAI WHISPER and NVIDIA NEMO - on multiple benchmark datasets, including TIE, VOX-POPULI, TEDLIUM, and FLEURS, show that there is a promising reduction in the average word error rate gap across the two gender groups. For a given dataset, the denoising is selectively applied on speech samples having speech intelligibility below a certain threshold, estimated using a small validation sample, thus ameliorating the need for large-scale human-written ground-truth transcripts. Our findings suggest that selective denoising can be an elegant approach to mitigate biases in present-day ASR systems.


Automatic Speech Recognition for Biomedical Data in Bengali Language

Kabir, Shariar, Nahar, Nazmun, Saha, Shyamasree, Rashid, Mamunur

arXiv.org Artificial Intelligence

Recent advancements in domain specific Automated Speech Recognition (ASR) and Large Language Models (LLM), have significantly boosted the adoption of AI in digital services across many different industries such as financial service, healthcare. In the healthcare industry in particular, integration of AI-driven solutions such as conversational chatbots, voice interactive guidance is opening new avenues to engage patients and healthcare providers ([1], [2]). Many healthcare systems in the developed world have been adopting these systems to increase patient satisfaction. One key shortcomings in this is that the majority of the developments in this domain are focused towards patients of European descent, their medical vocabularies. Many non-European languages, though spoken by millions, have seen very limited advancements. Bengali, despite being the seventh most popular language with 270 million speakers worldwide, has seen very limited progress in Bengali NLP and ASR research. This has hindered the integration of these technologies into digital health services for Bengali speakers which in turn slowed down the adoption of digital health solutions. While many European language speakers are benefiting from AI-driven services (conversational chatbot assisted) like digital appointment booking, symptom reporting before appointment and mental health support, Bengalis speakers are not able to benefit from these advancements. Bengali ASR research has seen a significant surge in recent years, fueled by the release of large public speech corpora like Google's "Large Bengali ASR training data" (LB-ASRTD).


The evaluation of a code-switched Sepedi-English automatic speech recognition system

Phaladi, Amanda, Modipa, Thipe

arXiv.org Artificial Intelligence

Speech technology is a field that encompasses various techniques and tools used to enable machines to interact with speech, such as automatic speech recognition (ASR), spoken dialog systems, and others, allowing a device to capture spoken words through a microphone from a human speaker. End-to-end approaches such as Connectionist Temporal Classification (CTC) and attention-based methods are the most used for the development of ASR systems. However, these techniques were commonly used for research and development for many high-resourced languages with large amounts of speech data for training and evaluation, leaving low-resource languages relatively underdeveloped. While the CTC method has been successfully used for other languages, its effectiveness for the Sepedi language remains uncertain. In this study, we present the evaluation of the Sepedi-English code-switched automatic speech recognition system. This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach. The performance of the system was evaluated using both the NCHLT Sepedi test corpus and the Sepedi Prompted Code Switching corpus. The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text.