non-verbal behavior
Mitigation of gender bias in automatic facial non-verbal behaviors generation
Delbosc, Alice, Ochs, Magalie, Sabouret, Nicolas, Ravenet, Brian, Ayache, Stephane
Research on non-verbal behavior generation for social interactive agents focuses mainly on the believability and synchronization of non-verbal cues with speech. However, existing models, predominantly based on deep learning architectures, often perpetuate biases inherent in the training data. This raises ethical concerns, depending on the intended application of these agents. This paper addresses these issues by first examining the influence of gender on facial non-verbal behaviors. We concentrate on gaze, head movements, and facial expressions. We introduce a classifier capable of discerning the gender of a speaker from their non-verbal cues. This classifier achieves high accuracy on both real behavior data, extracted using state-of-the-art tools, and synthetic data, generated from a model developed in previous work.Building upon this work, we present a new model, FairGenderGen, which integrates a gender discriminator and a gradient reversal layer into our previous behavior generation model. This new model generates facial non-verbal behaviors from speech features, mitigating gender sensitivity in the generated behaviors. Our experiments demonstrate that the classifier, developed in the initial phase, is no longer effective in distinguishing the gender of the speaker from the generated non-verbal behaviors.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Costa Rica > San José Province > San José (0.05)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- (3 more...)
Multilingual Dyadic Interaction Corpus NoXi+J: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement
Funk, Marius, Okada, Shogo, André, Elisabeth
Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.
- Europe > Germany (0.25)
- Europe > France (0.24)
- North America > Costa Rica > San José Province > San José (0.05)
- (14 more...)
Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent
Delbosc, Alice, Ochs, Magalie, Sabouret, Nicolas, Ravenet, Brian, Ayache, Stéphane
This paper introduces a new model to generate rhythmically relevant non-verbal facial behaviors for virtual agents while they speak. The model demonstrates perceived performance comparable to behaviors directly extracted from the data and replayed on a virtual agent, in terms of synchronization with speech and believability. Interestingly, we found that training the model with two different sets of data, instead of one, did not necessarily improve its performance. The expressiveness of the people in the dataset and the shooting conditions are key elements. We also show that employing an adversarial model, in which fabricated fake examples are introduced during the training phase, increases the perception of synchronization with speech. A collection of videos demonstrating the results and code can be accessed at: https://github.com/aldelb/non_verbal_facial_animation.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia (0.04)
- (6 more...)
Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module
Liu, Yihe, Yuan, Ziqi, Mao, Huisheng, Liang, Zhiyun, Yang, Wanqiuyue, Qiu, Yuanzhe, Cheng, Tie, Li, Xiaoteng, Xu, Hua, Gao, Kai
Multimodal sentiment analysis (MSA), which supposes to improve text-based sentiment analysis with associated acoustic and visual modalities, is an emerging research area due to its potential applications in Human-Computer Interaction (HCI). However, the existing researches observe that the acoustic and visual modalities contribute much less than the textual modality, termed as text-predominant. Under such circumstances, in this work, we emphasize making non-verbal cues matter for the MSA task. Firstly, from the resource perspective, we present the CH-SIMS v2.0 dataset, an extension and enhancement of the CH-SIMS. Compared with the original dataset, the CH-SIMS v2.0 doubles its size with another 2121 refined video segments with both unimodal and multimodal annotations and collects 10161 unlabelled raw video segments with rich acoustic and visual emotion-bearing context to highlight non-verbal cues for sentiment prediction. Secondly, from the model perspective, benefiting from the unimodal annotations and the unsupervised data in the CH-SIMS v2.0, the Acoustic Visual Mixup Consistent (AV-MC) framework is proposed. The designed modality mixup module can be regarded as an augmentation, which mixes the acoustic and visual modalities from different videos. Through drawing unobserved multimodal context along with the text, the model can learn to be aware of different non-verbal contexts for sentiment prediction. Our evaluations demonstrate that both CH-SIMS v2.0 and AV-MC framework enables further research for discovering emotion-bearing acoustic and visual cues and paves the path to interpretable end-to-end HCI applications for real-world scenarios.
- Asia > India > Karnataka > Bengaluru (0.06)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Automatic Thoughts and Facial Expressions in Cognitive Restructuring with Virtual Agents
Cognitive restructuring is a well-established mental health technique for amending automatic thoughts, which are distorted and biased beliefs about a situation, into objective and balanced thoughts. Since virtual agents can be used anytime and anywhere, they are expected to perform cognitive restructuring without being influenced by medical infrastructure or patients' stigma toward mental illness. Unfortunately, since the quantitative analysis of human-agent interaction is still insufficient, the effect on the user's cognitive state remains unclear. We collected interaction data between virtual agents and users to observe the mood improvements associated with changes in automatic thoughts that occur in user cognition and addressed the following two points: (1) implementation of a virtual agent that helps a user identify and evaluate automatic thoughts; (2) identification of the relationship between a user's facial expressions and the extent of the mood improvement subjectively felt by users during the human-agent interaction. We focus on these points because cognitive restructuring by a human therapist starts by identifying automatic thoughts and seeking sufficient evidence to find balanced thoughts (evaluation of automatic thoughts). Therapists also use such non-verbal behaviors as facial expressions to detect changes in a user's mood, which is an important indicator for guidance. Based on the results of this analysis, we provide a technical guidance framework that fully ...
Q&A: How AI tech can make humans more emotionally intelligent (Includes interview)
Cogito, an MIT spin-off, uses voice-based AI to analyze behavioral and vocal cues (such as pitch, tone, pace, etc.) to provide in-the-moment feedback during conversations, guiding individuals to be more emotionally intelligent and perceptive. Cogito's emotional intelligence technology is being used by large insurance organizations like MetLife, Humana, and Cigna to enhance employee productivity and improve human emotional intelligence during customer service calls. To discover more, Digital Journal spoke with Dr. John Kane of Cogito. Digital Journal: How sophisticated is AI becoming in general? Dr. John Kane: Artificial intelligence continues to have a transformational effect across industries.