We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points.
As part of an ongoing effort within Microsoft to improve the accuracy of artificial intelligence (AI) systems, Microsoft Translator is publicly releasing a set of data that includes multiple conversations between bilingual speakers who are speaking French, German and English. This corpus, which was produced by Microsoft using bilingual speakers, aims to create a standard by which people can measure how well their conversational speech translation systems work. It can serve as a standardized data set for testing bilingual conversational speech translation systems such as the Microsoft Translator live feature and Skype Translator. Christian Federmann, a senior program manager working with the Microsoft Translator team, said there aren't as many standardized data sets for testing bilingual conversational speech translation systems. "You need high-quality data in order to have high-quality testing," Federmann said.
While computer scientists have yet to build a working "universal translator" such as the one first described in the 1945 science-fiction novella "First Contact" and later employed by the crew of the Starship Enterprise on "Star Trek," the hurdles to creating one are being cleared. That is because the practical need for instant or simultaneous speech-to-speech translation is increasingly important in a number of applications. Take, for example, the hypergrowth of social networking and Skype chats that demand bidirectional, reliable, immediate translations. Similarly, when natural disasters strike, the problem of aid workers struggling to communicate with the stricken who often speak other languages can become overwhelming.