Speech Recognition: Overviews


Universal adversarial examples in speech command classification

arXiv.org Machine Learning

Adversarial examples are inputs intentionally perturbed with the aim of forcing a machine learning model to produce a wrong prediction, while the changes are not easily detectable by a human. Although this topic has been intensively studied in the image domain, classification tasks in the audio domain have received less attention. In this paper we address the existence of universal perturbations for speech command classification. We provide evidence that universal attacks can be generated for speech command classification tasks, which are able to generalize across different models to a significant extent. Additionally, a novel analytical framework is proposed for the evaluation of universal perturbations under different levels of universality, demonstrating that the feasibility of generating effective perturbations decreases as the universality level increases. Finally, we propose a more detailed and rigorous framework to measure the amount of distortion introduced by the perturbations, demonstrating that the methods employed by convention are not realistic in audio-based problems.


Artificial Intelligence and Machine Learning Drive The Future

#artificialintelligence

One of the disruptive technologies that has gained increasingly more attention after the turn of the century is Machine Learning. Machine Leaning – closely related and usually considered as a subfield of Artificial Intelligence (AI) – is the process of automatic detection of usable patterns within data. The detection of these patterns is performed with the help of machine learning algorithms which are specifically tailored to deal with complex and large data sets. Such powerful algorithms have the potential of drastically revolutionizing the way of doing business and how businesses operate. With this article I will provide an overview of opportunities that machine learning algorithms and Artificial Intelligence (AI) pose to the business environment.


More Than Half of Consumers Want to Use Voice Assistants for Healthcare - The Ritz Herald

#artificialintelligence

Orbita, Inc., provider of healthcare's most powerful conversational AI platform, today announced the release of the Voice Assistant Consumer Adoption Report for Healthcare 2019. To develop this report, Orbita sponsored independent research by Voicebot.ai, Based on a survey of 1,004 U.S. adults, the report includes these key highlights: The 40-page report includes 20 charts, ten case studies highlighting today's real-world voice-powered healthcare solutions, and 35 pages of analysis. It is available at no cost for download at voicebot.ai. "This report is the first comprehensive analysis that considers how consumers are using voice assistants today for healthcare-related needs, explores features they'd like to see in the future, and highlights how provider and technology organizations have responded to the opportunity thus far," said Orbita President Nathan Treloar.


Hierarchical Sequence to Sequence Voice Conversion with Limited Data

arXiv.org Machine Learning

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$source,target$>$} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use $F_0$, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.


A spelling correction model for end-to-end speech recognition

arXiv.org Artificial Intelligence

Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0% relative improvement when further rescoring an expanded n-best list using an external LM.


Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Neural Information Processing Systems

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.


Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Neural Information Processing Systems

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.


Introducing Wav2latter – Towards Data Science

#artificialintelligence

The current generation of speech recognition models rely mostly on recurrent neural networks(RNNs) for acoustic and language modeling and on computationally-expensive artifacts such as feature extraction pipelines for knowledge building. Recently, Facebook AI Research(FAIR) team published a research paper proposing a new speech recognition technique based solely on convolutional neural networks(CNNs). The FAIR team went beyond research and open sourced the Wav2letter, a high performance speech recognition toolkit based on the fully-convolutional method. The great advantage of CNNs over alternatives is that they naturally model the computation of standard features such as Mel-Frequency Cepstral Coefficients without requiring expensive feature extraction techniques. The architecture is based on a "Scattering-Based" model illustrated in the following figure:


A Voice Controlled E-Commerce Web Application

arXiv.org Machine Learning

Abstract-- Automatic voice-controlled systems have changed the way humans interact with a computer. Voice or speech recognition systems allow a user to make a hands-free request to the computer, which in turn processes the request and serves the user with appropriate responses. After years of research and developments in machine learning and artificial intelligence, today voice-controlled technologies have become more efficient and are widely applied in many domains to enable and improve human-tohuman andhuman-to-computer interactions. The state-of-the-art e-commerce applications with the help of web technologies offer interactive and user-friendly interfaces. However, there are some instances where people, especially with visual disabilities, are not able to fully experience the serviceability of such applications. A voice-controlled system embedded in a web application can enhance user experience and can provide voice as a means to control the functionality of e-commerce websites. In this paper, we propose a taxonomy of speech recognition systems (SRS) and present a voice-controlled commodity purchase e-commerce application using IBM Watson speech-to-text to demonstrate its usability. The prototype can be extended to other application scenarios such as government service kiosks and enable analytics of the converted text data for scenarios such as medical diagnosis at the clinics. I. INTRODUCTION Voice recognition is used interchangeably with speech recognition, however, voice recognition is primarily the task of determining the identity of a speaker rather than the content of the speaker's speech [1].


Double Coupled Canonical Polyadic Decomposition for Joint Blind Source Separation

arXiv.org Machine Learning

Joint blind source separation (J-BSS) is an emerging data-driven technique for multi-set data-fusion. In this paper, J-BSS is addressed from a tensorial perspective. We show how, by using second-order multi-set statistics in J-BSS, a specific double coupled canonical polyadic decomposition (DC-CPD) problem can be formulated. We propose an algebraic DC-CPD algorithm based on a coupled rank-1 detection mapping. This algorithm converts a possibly underdetermined DC-CPD to a set of overdetermined CPDs. The latter can be solved algebraically via a generalized eigenvalue decomposition based scheme. Therefore, this algorithm is deterministic and returns the exact solution in the noiseless case. In the noisy case, it can be used to effectively initialize optimization based DC-CPD algorithms. In addition, we obtain the determini- stic and generic uniqueness conditions for DC-CPD, which are shown to be more relaxed than their CPD counterpart. Experiment results are given to illustrate the superiority of DC-CPD over standard CPD based BSS methods and several existing J-BSS methods, with regards to uniqueness and accuracy.