Acoustic Processing

t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification Machine Learning

The ASVspoof challenge series was born to spearhead research in anti-spoofing for automatic speaker verification (ASV). The two challenge editions in 2015 and 2017 involved the assessment of spoofing countermeasures (CMs) in isolation from ASV using an equal error rate (EER) metric. While a strategic approach to assessment at the time, it has certain shortcomings. First, the CM EER is not necessarily a reliable predictor of performance when ASV and CMs are combined. Second, the EER operating point is ill-suited to user authentication applications, e.g. telephone banking, characterised by a high target user prior but a low spoofing attack prior. We aim to migrate from CM- to ASV-centric assessment with the aid of a new tandem detection cost function (t-DCF) metric. It extends the conventional DCF used in ASV research to scenarios involving spoofing attacks. The t-DCF metric has 6 parameters: (i) false alarm and miss costs for both systems, and (ii) prior probabilities of target and spoof trials (with an implied third, nontarget prior). The study is intended to serve as a self-contained, tutorial-like presentation. We analyse with the t-DCF a selection of top-performing CM submissions to the 2015 and 2017 editions of ASVspoof, with a focus on the spoofing attack prior. Whereas there is little to choose between countermeasure systems for lower priors, system rankings derived with the EER and t-DCF show differences for higher priors. We observe some ranking changes. Findings support the adoption of the DCF-based metric into the roadmap for future ASVspoof challenges, and possibly for other biometric anti-spoofing evaluations.

Speaker-independent raw waveform model for glottal excitation Machine Learning

Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.

AdMobilize to Introduce Voice Recognition Capabilities at DSE 2018


AdMobilize will introduce its MATRIX Voice dev board to the digital signage industry at DSE 2018 in booth 2369 at the Las Vegas Convention Center. "Put simply, the company that introduced AI-powered audience analytics to the digital signage industry is now bringing voice recognition functionality to both manufacturers and systems integrators alike through its MATRIX product line," said AdMobilize co-founder and CEO Rodolfo Saccoman. "We believe that voice engagement technologies will make digital signage a more compelling and sticky communications solution for an even broader range of vertical markets. The combination of audience analytics and voice recognition functionality truly represents the next chapter in this constantly evolving industry and AdMobilize is at the forefront of making this chapter a reality." Available for $55.00, MATRIX Voice will integrate with any voice recognition service (Amazon Alexa, Google Assistant, or any other third-party service) at any time.

Voice recognition software advancing rapidly. Will talking replace typing?


Since Apple developed Siri there have been great strides made in the science of voice recognition. Will we soon be throwing away our mice and keyboards and simply talking to our computers? Or will the problems I have with Alexa continue to haunt voice recognition? My wife and I are like all married couples at breakfast. We do not speak to each other.

Mozilla releases dataset and model to lower voice-recognition barriers


Mozilla has released its Common Voice collection, which contains almost 400,000 recordings from 20,000 people, and is claimed to be the second-largest voice dataset publicly available.

Graduates!! Get paid for building a new voice recognition app!


If you're a Machine Learning graduate and can speak the language of your chosen location plus English - this is for you! Do not miss these fantastic opportunities, apply today

Voice recognition and machine learning make service bots better


We are on the cusp of a technological revolution whereby increasingly sophisticated tasks can be handed over from humans to machines. Organizations are embracing advancements in artificial intelligence, robotics, and natural language technology to adopt platforms that can "learn" from experience and actually interact with users. The next wave of these chatbots will have enhanced real-time data analytics and automation capabilities and the ability to integrate intelligence across multiple digital channels to engage customers in natural conversations using voice or text. When you have a question about a product or service, you will be presented with the best agent, who possesses the entire company's collective experience and a huge wealth of knowledge to address your issue. Think about what happens today when you call your bank or the help desk of an ecommerce site.