People use AI for a wide range of speech recognition and understanding tasks, from enabling smart speakers to developing tools for people who are hard of hearing or who have speech impairments. But oftentimes these speech understanding systems don't work well in the everyday situations when we need them most: Where multiple people are speaking simultaneously or when there's lots of background noise. Even sophisticated noise-suppression techniques are often no match for, say, the sound of the ocean during a family beach trip or the background chatter of a bustling street market. One reason why people can understand speech better than AI in these instances is that we use not just our ears but also our eyes. We might see someone's mouth moving and intuitively know the voice we're hearing must be coming from her, for example.
"Although you took very thorough precautions in the pod against my hearing you, I could see your lips move." The increasing scale of AI is raising the stakes for major ethical questions. It is a fact widely known that people hear speech not just by listening with their ears but also by picking up cues from the mouth movements they observe on the part of speakers. Similarly, combining visual observation with audio could help a computer conceivably parse human speech better. In a sense, computer programs can read lips, though it is a laborious task to engineer.
Meta AI released a self-supervised speech recognition model that also uses video and achieves 75% better accuracy for some amount of data than current state-of-the-art models. This new model, Audio-Visual Hidden BERT (AV-HuBERT), uses audiovisual features for improving models based only on hearing speech. Visual features used are based on lip-reading, similar to what humans do. The lip-reading helps to filter background noise when someone is speaking, which is an extremely hard task only using audio. For generating input data, the first pre-processing is extracting audio and video features from video and creating clusters using k-means.
The main technique that is used during face to face communication is speech, but this involves a lot more than just listening to the words that people say. Reading someone's lips can also be a crucial aspect of this since it can help you parse the meaning of their words in situations where you might not be able to hear them all that clearly, and that is something that Meta seems to be taking into account when it comes to their AI. A lot of studies have revealed that it would be a lot more difficult to understand whatever it is that someone is trying to say if you can't see the manner in which their mouth is moving. Meta has developed a new framework called AV-HuBERT that will take both factors into account because of the fact that this is the sort of thing that could potentially end up vastly improving its speech recognition potential, although it should be said that this is only a test at this point. What Meta is basically trying to do is to see if anything can be gained by allowing AI to read lips as well as listen to audio recordings and the like.
Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened enabling, for instance, the well known "cocktail party" and McGurk effects, i.e. speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, Neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.