Goto

Collaborating Authors

 deepspeech


Automatic Speech Recognition for Biomedical Data in Bengali Language

Kabir, Shariar, Nahar, Nazmun, Saha, Shyamasree, Rashid, Mamunur

arXiv.org Artificial Intelligence

Recent advancements in domain specific Automated Speech Recognition (ASR) and Large Language Models (LLM), have significantly boosted the adoption of AI in digital services across many different industries such as financial service, healthcare. In the healthcare industry in particular, integration of AI-driven solutions such as conversational chatbots, voice interactive guidance is opening new avenues to engage patients and healthcare providers ([1], [2]). Many healthcare systems in the developed world have been adopting these systems to increase patient satisfaction. One key shortcomings in this is that the majority of the developments in this domain are focused towards patients of European descent, their medical vocabularies. Many non-European languages, though spoken by millions, have seen very limited advancements. Bengali, despite being the seventh most popular language with 270 million speakers worldwide, has seen very limited progress in Bengali NLP and ASR research. This has hindered the integration of these technologies into digital health services for Bengali speakers which in turn slowed down the adoption of digital health solutions. While many European language speakers are benefiting from AI-driven services (conversational chatbot assisted) like digital appointment booking, symptom reporting before appointment and mental health support, Bengalis speakers are not able to benefit from these advancements. Bengali ASR research has seen a significant surge in recent years, fueled by the release of large public speech corpora like Google's "Large Bengali ASR training data" (LB-ASRTD).


A Novel Scheme to classify Read and Spontaneous Speech

Kopparapu, Sunil Kumar

arXiv.org Artificial Intelligence

The COVID-19 pandemic has led to an increased use of remote telephonic interviews, making it important to distinguish between scripted and spontaneous speech in audio recordings. In this paper, we propose a novel scheme for identifying read and spontaneous speech. Our approach uses a pre-trained DeepSpeech audio-to-alphabet recognition engine to generate a sequence of alphabets from the audio. From these alphabets, we derive features that allow us to discriminate between read and spontaneous speech. Our experimental results show that even a small set of self-explanatory features can effectively classify the two types of speech very effectively.


DeepSpeech for Dummies - A Tutorial and Overview

#artificialintelligence

DeepSpeech is a neural network architecture first published by a research team at Baidu. In 2017, Mozilla created an open source implementation of this paper - dubbed "Mozilla DeepSpeech". The original DeepSpeech paper from Baidu popularized the concept of "end-to-end" speech recognition models. "End-to-end" means that the model takes in audio, and directly outputs characters or words. This is compared to traditional speech recognition models, like those built with popular open source libraries such as Kaldi or CMU Sphinx, that predict phonemes, and then convert those phonemes to words in a later, downstream process. The goal of "end-to-end" models, like DeepSpeech, was to simplify the speech recognition pipeline into a single model. In addition, the theory introduced by the Baidu research paper was that training large deep learning models, on large amounts of data, would yield better performance than classical speech recognition models.


Effects of Layer Freezing on Transferring a Speech Recognition System to Under-resourced Languages

Eberhard, Onno, Zesch, Torsten

arXiv.org Artificial Intelligence

In this paper, we investigate the effect of layer freezing on the effectiveness of model transfer in the area of automatic speech recognition. We experiment with Mozilla's DeepSpeech architecture on German and Swiss German speech datasets and compare the results of either training from scratch vs. transferring a pre-trained model. We compare different layer freezing schemes and find that even freezing only one layer already significantly improves results.


The real cost of cloud computing - VentureBeat - UrIoTNews

#artificialintelligence

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. The public cloud is growing rapidly and the market for the technology is expected to reach $1.3 trillion by 2025. The cloud has revolutionized the computing industry and enabled many applications, business models and enterprises, which otherwise wouldn't have been possible. Immediate availability, scalability, minimal capital expenditure and streamlined developer experience are its main advantages -- but it comes at a cost. Due to a lack of in-house infrastructure optimization capabilities, most enterprises stick to the cloud even after achieving certain maturity. To keep cloud spending under control, enterprises have built or acquired tools and services.


Visualizing Automatic Speech Recognition -- Means for a Better Understanding?

Markert, Karla, Parracone, Romain, Kulakov, Mykhailo, Sperl, Philip, Kao, Ching-Yu, Böttinger, Konstantin

arXiv.org Artificial Intelligence

Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output. We focus on three visualization techniques: Layer-wise Relevance Propagation (LRP), Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these methods and discuss potential further applications, such as in the detection of adversarial examples.


Automatic Speaker Independent Dysarthric Speech Intelligibility Assessment System

Tripathi, Ayush, Bhosale, Swapnil, Kopparapu, Sunil Kumar

arXiv.org Artificial Intelligence

Dysarthria is a condition which hampers the ability of an individual to control the muscles that play a major role in speech delivery. The loss of fine control over muscles that assist the movement of lips, vocal chords, tongue and diaphragm results in abnormal speech delivery. One can assess the severity level of dysarthria by analyzing the intelligibility of speech spoken by an individual. Continuous intelligibility assessment helps speech language pathologists not only study the impact of medication but also allows them to plan personalized therapy. It helps the clinicians immensely if the intelligibility assessment system is reliable, automatic, simple for (a) patients to undergo and (b) clinicians to interpret. Lack of availability of dysarthric data has resulted in development of speaker dependent automatic intelligibility assessment systems which requires patients to speak a large number of utterances. In this paper, we propose (a) a cost minimization procedure to select an optimal (small) number of utterances that need to be spoken by the dysarthric patient, (b) four different speaker independent intelligibility assessment systems which require the patient to speak a small number of words, and (c) the assessment score is close to the perceptual score that the Speech Language Pathologist (SLP) can relate to. The need for small number of utterances to be spoken by the patient and the score being relatable to the SLP benefits both the dysarthric patient and the clinician from usability perspective.


Audio Adversarial Examples: Attacks Using Vocal Masks

Tay, Kai Yuan, Ng, Lynnette, Chua, Wei Han, Loke, Lucerne, Ye, Danqi, Chua, Melissa

arXiv.org Artificial Intelligence

We construct audio adversarial examples on automatic Speech-To-Text systems . Given any audio waveform, we produce an another by overlaying an audio vocal mask generated from the original audio. We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx. In addition, we engaged human annotators to transcribe the adversarial audio. Our experiments show that these adversarial examples fool State-Of-The-Art Speech-To-Text systems, yet humans are able to consistently pick out the speech. The feasibility of this attack introduces a new domain to study machine and human perception of speech.


DeepSpeech 0.6: Mozilla's Speech-to-Text Engine Gets Fast, Lean, and Ubiquitous – Mozilla Hacks - the Web developer blog

#artificialintelligence

The Machine Learning team at Mozilla continues work on DeepSpeech, an automatic speech recognition (ASR) engine which aims to make speech recognition technology and trained models openly available to developers. DeepSpeech is a deep learning-based ASR engine with a simple API. We also provide pre-trained English models. Our latest release, version v0.6, offers the highest quality, most feature-packed model so far. In this overview, we'll show how DeepSpeech can transform your applications by enabling client-side, low-latency, and privacy-preserving speech recognition capabilities.


Mozilla updates DeepSpeech with an English language model that runs 'faster than real time'

#artificialintelligence

DeepSpeech, a suite of speech-to-text and text-to-speech engines maintained by Mozilla's Machine Learning Group, this morning received an update (to version 0.6) that incorporates one of the fastest open source speech recognition models to date. In a blog post, senior research engineer Reuben Morais lays out what's new and enhanced, as well as other spotlight features coming down the pipeline. The latest version of DeepSpeech adds support for TensorFlow Lite, a version of Google's TensorFlow machine learning framework that's optimized for compute-constrained mobile and embedded devices. It has reduced DeepSpeech's package size from 98MB to 3.7MB and its built-in English model size -- which has a 7.5% word error rate on a popular benchmark and which was trained on 5,516 hours of transcribed audio from WAMU (NPR), LibriSpeech, Fisher, Switchboard, and Mozilla's Common Voice English data sets -- from 188MB to 47MB. Plus, it has cut down DeepSpeech's memory consumption by 22 times and boosted its startup speed by over 500 times.