Systems that can classify a person's emotion from their voice and facial tics alone are a longstanding goal of some AI researchers. Firms like Affectiva, which recently launched a product that scans drivers' faces and voices to monitor their mood, are moving the needle in the right direction. But considerable challenges remain, owing to nuances in speech and muscle movements. Researchers at the University of Science and Technology of China in Hefei claim to have made progress, though. In a paper published on the preprint server Arxiv.org this week ("Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio video Emotion Recognition"), they describe an AI system that can recognize a person's emotional state with state-of-the-art accuracy on a popular benchmark.
Researchers have combined speech and facial recognition data to improve the emotion detection abilities of AIs. The ability to recognise emotions is a longstanding goal of AI researchers. Accurate recognition enables things such as detecting tiredness at the wheel, anger which could lead to a crime being committed, or perhaps even signs of sadness/depression at suicide hotspots. Nuances in how people speak and move their facial muscles to express moods have presented a challenge. Detailed in a paper (PDF) on Arxiv, researchers at the University of Science and Technology of China in Hefei have made some progress.
Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Although there is no consensus on a definition, human emotional states usually can be apperceived by auditory and visual systems. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER. However, most traditional fusion approaches only build a linear paradigm, such as feature concatenation and multi-system fusion, which hardly captures complex association between audio and video. In this paper, we introduce factorized bilinear pooling (FBP) to deeply integrate the features of audio and video. Specifically, the features are selected through the embedded attention mechanism from respective modalities to obtain the emotion-related regions. The whole pipeline can be completed in a neural network. Validated on the AFEW database of the audio-video sub-challenge in EmotiW2018, the proposed approach achieves an accuracy of 62.48%, outperforming the state-of-the-art result.
Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy rating. The experiments are conducted over the open-access multimodal dataset CREMA-D.
In this work, we analyze the happiness levels of countries using an unbiased emotion detector, artificial intelligence (AI). To date, researchers proposed many factors that may affect happiness such as wealth, health and safety. Even though these factors all seem relevant, there is no clear consensus between sociologists on how to interpret these, and the models to estimate the cost of these utilities include some assumptions. Researchers in social sciences have been working on determination of the happiness levels in society and exploration of the factors correlated with it through polls and different statistical methods. In our work, by using artificial intelligence, we introduce a different and relatively unbiased approach to this problem. By using AI, we make no assumption about what makes a person happy, and leave the decision to AI to detect the emotions from the faces of people collected from publicly available street footages. We analyzed the happiness levels in eight different cities around the world through available footage on the Internet and found out that there is no statistically significant difference between countries in terms of happiness.