Facebook's voice synthesis AI generates speech in 500 milliseconds
Facebook today unveiled a highly efficient, AI text-to-speech (TTS) system that can be hosted in real time using regular processors. In tandem with a new data collection approach, which leverages a language model for curation, Facebook says the system -- which produces a second of audio in 500 milliseconds -- enabled it to create a British-accented voice in six months as opposed to over a year for previous voices. Most modern AI TTS systems require graphics cards, field-programmable gate arrays (FPGAs), or custom-designed AI chips like Google's tensor processing units (TPUs) to run, train, or both. For instance, a recently detailed Google AI system was trained across 32 TPUs in parallel. Synthesizing a single second of humanlike audio can require outputting as many as 24,000 samples -- sometimes even more.
May-16-2020, 19:52:50 GMT