This tutorial will show you how to build a basic speech recognition network that recognizes ten different words. It's important to know that real speech and audio recognition systems are much more complex, but like MNIST for images, it should give you a basic understanding of the techniques involved. Once you've completed this tutorial, you'll have a model that tries to classify a one second audio clip as "down", "go", "left", "no", "right", "stop", "up" and "yes". Let's install TensorFlow to get started. The script will start off by downloading a portion of the Speech Commands dataset.
While much of the literature and buzz on deep learning concerns computer vision and natural language processing(NLP), audio analysis -- a field that includes automatic speech recognition(ASR), digital signal processing, and music classification, tagging, and generation -- is a growing subdomain of deep learning applications. Some of the most popular and widespread machine learning systems, virtual assistants Alexa, Siri, and Google Home, are largely products built atop models that can extract information from audio signals. Audio data analysis is about analyzing and understanding audio signals captured by digital devices, with numerous applications in the enterprise, healthcare, productivity, and smart cities. Applications include customer satisfaction analysis from customer support calls, media content analysis and retrieval, medical diagnostic aids and patient monitoring, assistive technologies for people with hearing impairments, and audio analysis for public safety. In the first part of this article series, we will talk about all you need to know before getting started with the audio data analysis and extract necessary features from a sound/audio file. We will also build an Artificial Neural Network(ANN) for the music genre classification.
The problem of identifying voice commands has always been a challenge due to the presence of noise and variability in speed, pitch, etc. We will compare the efficacies of several neural network architectures for the speech recognition problem. In particular, we will build a model to determine whether a one second audio clip contains a particular word (out of a set of 10), an unknown word, or silence. The models to be implemented are a CNN recommended by the Tensorflow Speech Recognition tutorial, a low-latency CNN, and an adversarially trained CNN. The result is a demonstration of how to convert a problem in audio recognition to the better-studied domain of image classification, where the powerful techniques of convolutional neural networks are fully developed. Additionally, we demonstrate the applicability of the technique of Virtual Adversarial Training (VAT) to this problem domain, functioning as a powerful regularizer with promising potential future applications.
We have seen a lot of recent advances in deep learning related to vision and language fields, it is intuitive to understand why CNN performs very well on images, with pixel's local correlation, and how sequential models like RNNs or transformers also perform very well on language, with its sequential nature, but what about audio? In this article you will learn how to approach a simple audio classification problem, you will learn some of the common and efficient methods used, and the Tensorflow code to do it. Disclaimer: The code presented here is based on my work developed for the "Rainforest Connection Species Audio Detection" Kaggle competition, but for demonstration purposes, I will use the "Speech Commands" dataset. We usually have audio files in the ".wav" format, they are commonly referred to as waveforms, a waveform is a time series with the signal amplitude at each specific time, if we visualize one of those waveform samples we will get something like this: Intuitively one might consider modeling this data like a regular time series (e.g. stock price forecasting) using some kind of RNN model, in fact, this could be done, but since we are using audio signals, a more appropriate choice is to transform the waveform samples into spectrograms. A spectrogram is an image representation of the waveform signal, it shows its frequency intensity range over time, it can be very useful when we want to evaluate the signal's frequency distribution over time.
While deep learning models are able to help tackle many different types of problems, image classification is the most prevalent example for courses and frameworks, often acting as the "hello, world" introduction. FastAI is a high-level library built on top of PyTorch that makes it extremely easy to get started classifying images, with an example showing how train an accurate model in only four lines of code. After competing in the Freesound General-Purpose Audio Tagging Kaggle competition over the summer, I decided to repurpose some of my code to take advantage of fastai's benefits for audio classification as well. This article will give a quick introduction to working with audio files in Python, give some background around creating spectrogram images, and then show how to leverage pretrained image models without actually having to generate images beforehand. All the code used to generate the content of this post will be available in this repository, complete with example notebooks.