Goto

Collaborating Authors

Why Is Speech Recognition Technology So Difficult to Perfect?

Huffington Post - Tech news and opinion

This is an excellent question to start off an automatic speech recognition (ASR) interview. I would slightly rephrase the question as "Why is speech recognition hard?" An ASR is just like any other machine learning (ML) problem, where the objective is to classify a sound wave into one of the basic units of speech (also called a "class" in ML terminology), such as a word. The problem with human speech is the huge amount of variation that occurs while pronouncing a word. For example, below are two recordings of the word "Yes" spoken by the same person (wave source: AN4 dataset [1]).


How we learned to talk to computers, and how they learned to answer back - TechRepublic

#artificialintelligence

Remember the famous scene in Stanley Kubrick's 1968 2001: A Space Odyssey, when Hal 9000--the intelligent-turned-malevolent computer--regresses to his "childhood" and sings "Daisy Bell" as he's decommissioned by astronaut Dave Bowman? Its inspiration was a real-life Bell Labs demonstration of speech synthesis on an IBM 704 mainframe in 1961, witnessed by Arthur C Clarke, who later incorporated it into his 2001 novel and screenplay. Although Bell Labs' involvement in the field stretches back to the 1930s with Homer Dudley's keyboard-and-footpedal-driven Voder speech synthesis device, it's undoubtedly the classic Kubrick/Clarke movie that cemented the ideas of artificial intelligence (AI) and conversing with computers into the public mind. Depending on how old you are, we're now familiar with computerised voices, thanks to devices like Texas Instruments' popular 1978 Speak & Spell educational toy, Stephen Hawking's speech synthesiser (memorably sampled in the Pink Floyd song Keep Talking), GPS navigational systems in your car, and any number of public information and call handling systems. More recently, the combination of automatic speech recognition (ASR), natural-language understanding (NLU) and text-to-speech (TTS) has come to mainstream attention in virtual assistants such as Apple's Siri, Google Now, Microsoft's Cortana, and Amazon's Alexa. Download this article as a PDF (free registration required). To get a handle on how speech technologies work, we clearly need to know something about the mechanics of human speech and the structure of language. When we speak, air from the lungs passes through the vocal tract to produce "voiced" or "unvoiced" sounds (depending on whether the vocal cords are vibrating or not) that may then be modulated by the tongue, teeth and lips.


How we learned to talk to computers, and how they learned to answer back ZDNet

#artificialintelligence

This article was originally published on TechRepublic. Remember the famous scene in Stanley Kubrick's 1968 2001: A Space Odyssey, when Hal 9000--the intelligent-turned-malevolent computer--regresses to his "childhood" and sings "Daisy Bell" as he's decommissioned by astronaut Dave Bowman? Its inspiration was a real-life Bell Labs demonstration of speech synthesis on an IBM 704 mainframe in 1961, witnessed by Arthur C Clark, who later incorporated it into his 2001 novel and screenplay. Although Bell Labs' involvement in the field stretches back to the 1930s with Homer Dudley's keyboard-and-footpedal-driven Voder speech synthesis device, it's undoubtedly the classic Kubrick/Clarke movie that cemented the ideas of artificial intelligence (AI) and conversing with computers into the public mind. Depending on how old you are, we're now familiar with computerised voices, thanks to devices like Texas Instruments' popular 1978 Speak & Spell educational toy, Stephen Hawking's speech synthesiser (memorably sampled in the Pink Floyd song Keep Talking), GPS navigational systems in your car, and any number of public information and call handling systems. More recently, the combination of automatic speech recognition (ASR), natural-language understanding (NLU) and text-to-speech (TTS) has come to mainstream attention in virtual assistants such as Apple's Siri, Google Now, Microsoft's Cortana, and Amazon's Alexa. To get a handle on how speech technologies work, we clearly need to know something about the mechanics of human speech and the structure of language. When we speak, air from the lungs passes through the vocal tract to produce "voiced" or "unvoiced" sounds (depending on whether the vocal cords are vibrating or not) that may then be modulated by the tongue, teeth and lips.


Tatistical Context-Dependent Units Boundary Correction for Corpus-based Unit-Selection Text-to-Speech

arXiv.org Machine Learning

Unlike conventional techniques for speaker adaptation, which attempt to improve the accuracy of the segmentation using acoustic models that are more robust in the face of the speaker's characteristics, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques. In simple terms, we use the intuitive idea that context dependent information is tightly correlated with the related acoustic waveform. We propose a statistical model, which predicts correcting values to reduce the systematic error produced by a state-of-the-art Hidden Markov Model (HMM) based speech segmentation. In other words, we can predict how HMM-based Automatic Speech Recognition (ASR) systems interpret the waveform signal determining the systematic error in different contextual scenarios. Our approach consists of two phases: (1) identifying contextdependent phonetic unit classes (for instance, the class which identifies vowels as being the nucleus of monosyllabic words); and (2) building a regression model that associates the mean error value made by the ASR during the segmentation of a single speaker corpus to each class. The success of the approach is evaluated by comparing the corrected boundaries of units and the state-of-the-art HHM segmentation against a reference alignment, which is supposed to be the optimal solution. The results of this study show that the contextdependent correction of units' boundaries has a positive influence on the forced alignment, especially when the misinterpretation of the phone is driven by acoustic properties linked to the speaker's phonetic characteristics. In conclusion, our work supplies a first analysis of a model sensitive to speaker-dependent characteristics, robust to defective and noisy information, and a very simple implementation which could be utilized as an alternative to either more expensive speaker-adaptation systems or of numerous manual correction sessions.


Speech Recognition with no speech or with noisy speech

arXiv.org Machine Learning

The performance of automatic speech recognition systems(ASR) degrades in the presence of noisy speech. This paper demonstrates that using electroencephalography (EEG) can help automatic speech recognition systems overcome performance loss in the presence of noise. The paper also shows that distillation training of automatic speech recognition systems using EEG features will increase their performance. Finally, we demonstrate the ability to recognize words from EEG with no speech signal on a limited English vocabulary with high accuracy.