Goto

Collaborating Authors

Speech Recognition: Overviews


Deep Neural Networks for Automatic Speech Processing: A Survey from Large Corpora to Limited Data

arXiv.org Machine Learning

Most state-of-the-art speech systems are using Deep Neural Networks (DNNs). Those systems require a large amount of data to be learned. Hence, learning state-of-the-art frameworks on under-resourced speech languages/problems is a difficult task. Problems could be the limited amount of data for impaired speech. Furthermore, acquiring more data and/or expertise is time-consuming and expensive. In this paper we position ourselves for the following speech processing tasks: Automatic Speech Recognition, speaker identification and emotion recognition. To assess the problem of limited data, we firstly investigate state-of-the-art Automatic Speech Recognition systems as it represents the hardest tasks (due to the large variability in each language). Next, we provide an overview of techniques and tasks requiring fewer data. In the last section we investigate few-shot techniques as we interpret under-resourced speech as a few-shot problem. In that sense we propose an overview of few-shot techniques and perspectives of using such techniques for the focused speech problems in this survey. It occurs that the reviewed techniques are not well adapted for large datasets. Nevertheless, some promising results from the literature encourage the usage of such techniques for speech processing.


Wavesplit: End-to-End Speech Separation by Speaker Clustering

arXiv.org Machine Learning

We introduce Wavesplit, an end-to-end speech separation system. From a single recording of mixed speech, the model infers and clusters representations of each speaker and then estimates each source signal conditioned on the inferred representations. The model is trained on the raw waveform to jointly perform the two tasks. Our model infers a set of speaker representations through clustering, which addresses the fundamental permutation problem of speech separation. Moreover, the sequence-wide speaker representations provide a more robust separation of long, challenging sequences, compared to previous approaches. We show that Wavesplit outperforms the previous state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2mix, WSJ0-3mix), as well as in noisy (WHAM!) and reverberated (WHAMR!) conditions. As an additional contribution, we further improve our model by introducing online data augmentation for separation.


On the human evaluation of audio adversarial examples

arXiv.org Machine Learning

Human-machine interaction is increasingly dependent on speech communication. Machine Learning models are usually applied to interpret human speech commands. However, these models can be fooled by adversarial examples, which are inputs intentionally perturbed to produce a wrong prediction without being noticed. While much research has been focused on developing new techniques to generate adversarial perturbations, less attention has been given to aspects that determine whether and how the perturbations are noticed by humans. This question is relevant since high fooling rates of proposed adversarial perturbation strategies are only valuable if the perturbations are not detectable. In this paper we investigate to which extent the distortion metrics proposed in the literature for audio adversarial examples, and which are commonly applied to evaluate the effectiveness of methods for generating these attacks, are a reliable measure of the human perception of the perturbations. Using an analytical framework, and an experiment in which 18 subjects evaluate audio adversarial examples, we demonstrate that the metrics employed by convention are not a reliable measure of the perceptual similarity of adversarial examples in the audio domain.


Global AI/Machine Learning Market 2019-2025 forecast by top players : GOOGLE, IBM, BAIDU, SOUNDHOUND, ZEBRA MEDICAL VISION, PRISMA – News Cast Report

#artificialintelligence

About Us: With unfailing market gauging skills, Orbis Market Reports has been excelling in curating tailored business intelligence data across industry verticals. Constantly thriving to expand our skill development, our strength lies in dedicated intellectuals with dynamic problem-solving intent, ever willing to mold boundaries to scale heights in market interpretation.


Universal adversarial examples in speech command classification

arXiv.org Machine Learning

Adversarial examples are inputs intentionally perturbed with the aim of forcing a machine learning model to produce a wrong prediction, while the changes are not easily detectable by a human. Although this topic has been intensively studied in the image domain, classification tasks in the audio domain have received less attention. In this paper we address the existence of universal perturbations for speech command classification. We provide evidence that universal attacks can be generated for speech command classification tasks, which are able to generalize across different models to a significant extent. Additionally, a novel analytical framework is proposed for the evaluation of universal perturbations under different levels of universality, demonstrating that the feasibility of generating effective perturbations decreases as the universality level increases. Finally, we propose a more detailed and rigorous framework to measure the amount of distortion introduced by the perturbations, demonstrating that the methods employed by convention are not realistic in audio-based problems.


Artificial Intelligence and Machine Learning Drive The Future

#artificialintelligence

One of the disruptive technologies that has gained increasingly more attention after the turn of the century is Machine Learning. Machine Leaning – closely related and usually considered as a subfield of Artificial Intelligence (AI) – is the process of automatic detection of usable patterns within data. The detection of these patterns is performed with the help of machine learning algorithms which are specifically tailored to deal with complex and large data sets. Such powerful algorithms have the potential of drastically revolutionizing the way of doing business and how businesses operate. With this article I will provide an overview of opportunities that machine learning algorithms and Artificial Intelligence (AI) pose to the business environment.


More Than Half of Consumers Want to Use Voice Assistants for Healthcare - The Ritz Herald

#artificialintelligence

Orbita, Inc., provider of healthcare's most powerful conversational AI platform, today announced the release of the Voice Assistant Consumer Adoption Report for Healthcare 2019. To develop this report, Orbita sponsored independent research by Voicebot.ai, Based on a survey of 1,004 U.S. adults, the report includes these key highlights: The 40-page report includes 20 charts, ten case studies highlighting today's real-world voice-powered healthcare solutions, and 35 pages of analysis. It is available at no cost for download at voicebot.ai. "This report is the first comprehensive analysis that considers how consumers are using voice assistants today for healthcare-related needs, explores features they'd like to see in the future, and highlights how provider and technology organizations have responded to the opportunity thus far," said Orbita President Nathan Treloar.


A 20-Year Community Roadmap for Artificial Intelligence Research in the US

arXiv.org Artificial Intelligence

Decades of research in artificial intelligence (AI) have produced formidable technologies that are providing immense benefit to industry, government, and society. AI systems can now translate across multiple languages, identify objects in images and video, streamline manufacturing processes, and control cars. The deployment of AI systems has not only created a trillion-dollar industry that is projected to quadruple in three years, but has also exposed the need to make AI systems fair, explainable, trustworthy, and secure. Future AI systems will rightfully be expected to reason effectively about the world in which they (and people) operate, handling complex tasks and responsibilities effectively and ethically, engaging in meaningful communication, and improving their awareness through experience. Achieving the full potential of AI technologies poses research challenges that require a radical transformation of the AI research enterprise, facilitated by significant and sustained investment. These are the major recommendations of a recent community effort coordinated by the Computing Community Consortium and the Association for the Advancement of Artificial Intelligence to formulate a Roadmap for AI research and development over the next two decades.


Machine Learning at the Network Edge: A Survey

arXiv.org Machine Learning

Devices comprising the Internet of Things, such as sensors and small cameras, usually have small memories and limited computational power. The proliferation of such resource-constrained devices in recent years has led to the generation of large quantities of data. These data-producing devices are appealing targets for machine learning applications but struggle to run machine learning algorithms due to their limited computing capability. They typically offload input data to external computing systems (such as cloud servers) for further processing. The results of the machine learning computations are communicated back to the resource-scarce devices, but this worsens latency, leads to increased communication costs, and adds to privacy concerns. Therefore, efforts have been made to place additional computing devices at the edge of the network, i.e close to the IoT devices where the data is generated. Deploying machine learning systems on such edge devices alleviates the above issues by allowing computations to be performed close to the data sources. This survey describes major research efforts where machine learning has been deployed at the edge of computer networks.


Hierarchical Sequence to Sequence Voice Conversion with Limited Data

arXiv.org Machine Learning

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$source,target$>$} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use $F_0$, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.