Recently I started working on a speech classification problem, as I know very little about speech/audio processing, I had to recap the very basics. In this post, I want to go over some of the things I learned. For this purpose, I want to work on the "speech MNIST" dataset, i.e, a set of recorded spoken digits. You can find the dataset here. As mentioned in the title, this is a classification problem, we get a recording and we need to predict the spoken digit in it.
The problem of identifying voice commands has always been a challenge due to the presence of noise and variability in speed, pitch, etc. We will compare the efficacies of several neural network architectures for the speech recognition problem. In particular, we will build a model to determine whether a one second audio clip contains a particular word (out of a set of 10), an unknown word, or silence. The models to be implemented are a CNN recommended by the Tensorflow Speech Recognition tutorial, a low-latency CNN, and an adversarially trained CNN. The result is a demonstration of how to convert a problem in audio recognition to the better-studied domain of image classification, where the powerful techniques of convolutional neural networks are fully developed. Additionally, we demonstrate the applicability of the technique of Virtual Adversarial Training (VAT) to this problem domain, functioning as a powerful regularizer with promising potential future applications.
Stochastic Signal Analysis is a field of science concerned with the processing, modification and analysis of (stochastic) signals. Anyone with a background in Physics or Engineering knows to some degree about signal analysis techniques, what these technique are and how they can be used to analyze, model and classify signals. Data Scientists coming from a different fields, like Computer Science or Statistics, might not be aware of the analytical power these techniques bring with them. In this blog post, we will have a look at how we can use Stochastic Signal Analysis techniques, in combination with traditional Machine Learning Classifiers for accurate classification and modelling of time-series and signals. At the end of the blog-post you should be able understand the various signal-processing techniques which can be used to retrieve features from signals and be able to classify ECG signals (and even identify a person by their ECG signal), predict seizures from EEG signals, classify and identify targets in radar signals, identify patients with neuropathy or myopathyetc from EMG signals by using the FFT, etc etc. In this blog-post we'll discuss the following topics: You might often have come across the words time-series and signals describing datasets and it might not be clear what the exact difference between them is.
There are countless ways to perform audio processing. The usual flow for running experiments with Artificial Neural Networks in TensorFlow with audio inputs is to first preprocess the audio, then feed it to the Neural Net. What happens though when one wants to perform audio processing somewhere in the middle of the computation graph? TensorFlow comes with an implementation of the Fast Fourier Transform, but it is not enough. In this post we will explain how we implemented it and provide the code so that the Short Time Fourier Transform can be used anywhere in the computation graph.