The one and only reason why businesses are turning to automatic emotion detection is you! Emotion sensing technologies are expanding exponentially. Market researchers estimate the Emotion Detection & Recognition (EDR) business to grow at a compound annual growth rate (CAGR) of 27.20–39.9%, One of the most common ways to automatically recognize emotions is via facial detection in photos and videos. The list of softwares or APIs that allow you to do that keeps on getting longer.
The final step for many artificial intelligence (AI) researchers is the development of a system that can identify human emotion from voice and facial expressions. While some facial scanning technology is available, there is still a long way to go in terms of properly identifying emotional states due to the complexity of nuances in speech as well as facial muscle movement. The University of Science and Technology researchers in Hefei, China, believe that they have made a breakthrough. Their paper, "Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition," expresses how an AI system may be able to recognize human emotion through state-of-the-art accuracy on a popular benchmark. In their published paper, the researchers say, "Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER … The whole pipeline can be completed in a neural network."
While several approaches to face emotion recognition task are proposed in literature, none of them reports on power consumption nor inference time required to run the system in an embedded environment. Without adequate knowledge about these factors it is not clear whether we are actually able to provide accurate face emotion recognition in the embedded environment or not, and if not, how far we are from making it feasible and what are the biggest bottlenecks we face. The main goal of this paper is to answer these questions and to convey the message that instead of reporting only detection accuracy also power consumption and inference time should be reported as real usability of the proposed systems and their adoption in human computer interaction strongly depends on it. In this paper, we identify the state-of-the art face emotion recognition methods that are potentially suitable for embedded environment and the most frequently used datasets for this task. Our study shows that most of the performed experiments use datasets with posed expressions or in a particular experimental setup with special conditions for image collection. Since our goal is to evaluate the performance of the identified promising methods in the realistic scenario, we collect a new dataset with non-exaggerated emotions and we use it, in addition to the publicly available datasets, for the evaluation of detection accuracy, power consumption and inference time on three frequently used embedded devices with different computational capabilities. Our results show that gray images are still more suitable for embedded environment than color ones and that for most of the analyzed systems either inference time or energy consumption or both are limiting factor for their adoption in real-life embedded applications.
Researchers have combined speech and facial recognition data to improve the emotion detection abilities of AIs. The ability to recognise emotions is a longstanding goal of AI researchers. Accurate recognition enables things such as detecting tiredness at the wheel, anger which could lead to a crime being committed, or perhaps even signs of sadness/depression at suicide hotspots. Nuances in how people speak and move their facial muscles to express moods have presented a challenge. Detailed in a paper (PDF) on Arxiv, researchers at the University of Science and Technology of China in Hefei have made some progress.
In this work, we train fully convolutional networks to detect anger in speech. Since training these deep architectures requires large amounts of data and the size of emotion datasets is relatively small, we use transfer learning. However, unlike previous approaches that use speech or emotion-based tasks for the source model, we instead use SoundNet, a fully convolutional neural network trained multimodally on a massive video dataset to classify audio, with ground-truth labels provided by vision-based classifiers. As a result of transfer learning from SoundNet, our trained anger detection model improves performance and generalizes well on a variety of acted, elicited, and natural emotional speech datasets. We also test the cross-lingual effectiveness of our model by evaluating our English-trained model on Mandarin Chinese speech emotion data. Furthermore, our proposed system has low latency suitable for real-time applications, only requiring 1.2 seconds of audio to make a reliable classification.