Microsoft's new VALL-E AI can capture your voice in 3 seconds
Microsoft researchers have presented an impressive new text-to-speech AI model, called Vall-E, which can listen to a voice for just a few seconds, then mimic that voice – including the emotional tone and acoustics – to say whatever you like. It's the latest of many AI algorithms that can harness a recording of a person's voice and make it say words and sentences that person never spoke – and it's remarkable for just how small a scrap of audio it needs in order to extrapolate an entire human voice. Where 2017's Lyrebird algorithm from the University of Montreal, for example, needed a full minute of speech to analyze, Vall-E needs just a three-second audio snippet. The AI has been trained on some 60,000 hours of English speech – mainly, it seems, by audiobook narrators, and the researchers have presented a swag of samples, in which Vall-E attempts to puppeteer a range of human voices. Some do a pretty extraordinary job of capturing the essence of the voice and building new sentences that sound natural – you'd struggle to tell which was the real voice and which was the synthesis. In others, the only giveaway is when the AI puts the emphasis in strange places in the sentence.
Jan-11-2023, 06:37:48 GMT