The new system, dubbed the Translatotron, has three components, all of which look at the speaker's audio spectrogram--a visual snapshot of the frequencies used when the sound is playing, often called a voiceprint. The first component uses a neural network trained to map the audio spectrogram in the input language to the audio spectrogram in the output language. The second converts the spectrogram into an audio wave that can be played.
On Wednesday, Google unveiled Translatotron, an in-development speech-to-speech translation system. It's not the first system to translate speech from one language to another, but Google designed Translatotron to do something other systems can't: retain the original speaker's voice in the translated audio. In other words, the tech could make it sound like you're speaking a language you don't know -- a remarkable step forward on the path to breaking down the global language barrier. According to Google's AI blog, most speech-to-speech translation systems follow a three-step process. First they transcribe the speech.
Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper develops such a dataset for half-truth audio detection (HAD). Partially fake audio in the HAD dataset involves only changing a few words in an utterance.The audio of the words is generated with the very latest state-of-the-art speech synthesis technology. We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset. Some benchmark results are presented on this dataset. The results show that partially fake audio presents much more challenging than fully fake audio for fake audio detection.
We construct audio adversarial examples on automatic Speech-To-Text systems . Given any audio waveform, we produce an another by overlaying an audio vocal mask generated from the original audio. We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx. In addition, we engaged human annotators to transcribe the adversarial audio. Our experiments show that these adversarial examples fool State-Of-The-Art Speech-To-Text systems, yet humans are able to consistently pick out the speech. The feasibility of this attack introduces a new domain to study machine and human perception of speech.
Over the last few years, Voice Assistants have become ubiquitous with the popularity of Google Home, Amazon Echo, Siri, Cortana, and others. These are the most well-known examples of Automatic Speech Recognition (ASR). This class of applications starts with a clip of spoken audio in some language and extracts the words that were spoken, as text. For this reason, they are also known as Speech-to-Text algorithms. Of course, applications like Siri and the others mentioned above, go further.