People use AI for a wide range of speech recognition and understanding tasks, from enabling smart speakers to developing tools for people who are hard of hearing or who have speech impairments. But oftentimes these speech understanding systems don't work well in the everyday situations when we need them most: Where multiple people are speaking simultaneously or when there's lots of background noise. Even sophisticated noise-suppression techniques are often no match for, say, the sound of the ocean during a family beach trip or the background chatter of a bustling street market. One reason why people can understand speech better than AI in these instances is that we use not just our ears but also our eyes. We might see someone's mouth moving and intuitively know the voice we're hearing must be coming from her, for example.
Abstract: The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem – unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available.
This paper describes the structure and operation of the Hearsay speech understanding system by the use of a specific example illustrating the various stages of recognition. The system consists of a set of cooperating independent processes, each representing a source of Knowledge. The knowledge is used either to predict what may appear in a given context or to verify hypotheses resulting from a prediction. The structure of the system is illustrated by considering its Operation in a particular task situation: Voice-Chess. The representation and use of various sources of knowledge are outlined. Preliminary results of the reduction in search resulting from the use of various sources of knowledge are given.See also: IEEE Transactions on Computers C-25:427-431.(1976).In IJCAI-73: THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 20-23 August 1973, Stanford University Stanford, California.
The study tested five publicly available tools from Apple, Amazon, Google, IBM and Microsoft that anyone can use to build speech recognition services. These tools are not necessarily what Apple uses to build Siri or Amazon uses to build Alexa. But they may share underlying technology and practices with services like Siri and Alexa. Each tool was tested last year, in late May and early June, and they may operate differently now. The study also points out that when the tools were tested, Apple's tool was set up differently from the others and required some additional engineering before it could be tested.