Combining CNN and RNN for spoken language identification · YerevaNN
Last year Hrayr used convolutional networks to identify spoken language from short audio recordings for a TopCoder contest and got 95% accuracy. After the end of the contest we decided to try recurrent neural networks and their combinations with CNNs on the same task. The best combination allowed to reach 99.24% and an ensemble of 33 models reached 99.67%. As before, the inputs of the networks are spectrograms of speech recordings. It seems spectrograms are the standard way to represent audio for deep learning systems (see "Listen, Attend and Spell" and "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin").
Jun-27-2016, 19:39:51 GMT