When attempting to learn a model with a wide-spectrum waveform (CD quality) directly from the pulse coded modulation (PCM) digitised representation of the audio, the network would be required to learn a sequence of 44,100 samples of 16-bit data for a single second of audio. A major appeal of Codec 2 is that the harmonic sinusoidal coding relates all encoded harmonic components back to the primary frequency. In the 1300bps version of the codec a single 16 parameter frame represents 40ms of audio, such that just 25 frames are required per second (a rate of'symbols' to be learned of approximately 1/1700 of 44.1kbps PCM), with more meaningful vocal representation. Simply described, the network takes Codec 2 encoded audio as its input, and utilises three long short-term memory LSTM layers (ref: 10) with a final fully connected rectified linear unit (ReLU) layer.
Artificial intelligence researchers at DeepMind have created some of the most realistic sounding human-like speech, using neural networks. Dubbed WaveNet, the AI promises significant improvements to computer-generated speech, and could eventually be used in digital personal assistants such as Siri, Cortana and Amazon's Alexa. The technology generates voices by sampling real human speech from both English and Mandarin speakers. In tests, the WaveNet generated speech was found to be more realistic than other forms of text-to-speech programs but still falling short of being truly convincing. In 500 blind tests, respondents were asked to judge sample sentences on a scale of one to five (five being most realistic).
Speech processing is a very popular area of machine learning. There is a significant demand in transforming human speech into text and text into speech. It is especially important regarding the development of self-services in different places: shops, transport, hotels, etc. Machines replace more and more human labor force, and these machines should be able to communicate with us using our language. That's why speech recognition is a perspective and significant area of artificial intelligence and machine learning. Today, many large companies provide APIs for performing different machine learning tasks. Speech recognition is not an exception. You don't have to be the expert in natural language processing to use these APIs.
Google's DeepMind announced the WaveNet project, a fully convolutional, probabilistic and autoregressive deep neural network. It synthesizes new speech and music from audio and sounds more natural than the best existing Text-To-Speech (TTS) systems, according to DeepMind. Speech synthesis is largely based on concatenative TTS, where a database of short speech fragments are recorded from a single speaker and recombined to form speech. This approach isn't flexible and can't be adjusted to new voice inputs easily, often resulting in the need to completely rebuild a dataset when there's a desire to drastically alter existing voice properties. DeepMind notes that while previous models typically hinge around a large audio dataset from a single input source, or single person, WaveNet retains its models as sets of parameters that can be modified based on new input to an existing model.
In this paper we propose an end-to-end LSTM-based model that performs single-channel speech enhancement and phone recognition in a cocktail party scenario where visual information of the target speaker is available. In the speech enhancement phase the proposed system uses a "visual attention" signal of the speaker of interest to extract her speech from the input mixed-speech signal, while in the ASR phase it recognizes her phone sequence through a phone recognizer trained with a CTC loss. It is well known that learning multiple related tasks from data simultaneously can improve performance than learning these tasks independently, therefore we decided to train the model by optimizing both tasks at the same time. This allowed us also to explore whether (and how) this joint optimization leads to better results. We analyzed different training strategies that reveal some interesting and unexpected behaviors. In particular, the experiments demonstrated that during optimization of the ASR phase the speech enhancement capability of the model significantly decreases and vice-versa. We evaluated our approach on mixed-speech versions of GRID and TCD-TIMIT. The obtained results show a remarkable drop of the Phone Error Rate (PER) compared to the audio-visual baseline models trained only to perform phone recognition phase.