Goto

Collaborating Authors

 body language


Classification of User Satisfaction in HRI with Social Signals in the Wild

Schiffmann, Michael, Jeschke, Sabina, Richert, Anja

arXiv.org Artificial Intelligence

Socially interactive agents (SIAs) are being used in various scenarios and are nearing productive deployment. Evaluating user satisfaction with SIAs' performance is a key factor in designing the interaction between the user and SIA. Currently, subjective user satisfaction is primarily assessed manually through questionnaires or indirectly via system metrics. This study examines the automatic classification of user satisfaction through analysis of social signals, aiming to enhance both manual and autonomous evaluation methods for SIAs. During a field trial at the Deutsches Museum Bonn, a Furhat Robotics head was employed as a service and information hub, collecting an "in-the-wild" dataset. This dataset comprises 46 single-user interactions, including questionnaire responses and video data. Our method focuses on automatically classifying user satisfaction based on time series classification. We use time series of social signal metrics derived from the body pose, time series of facial expressions, and physical distance. This study compares three feature engineering approaches on different machine learning models. The results confirm the method's effectiveness in reliably identifying interactions with low user satisfaction without the need for manually annotated datasets. This approach offers significant potential for enhancing SIA performance and user experience through automated feedback mechanisms.


AI's Impact on Mental Health

Communications of the ACM

There is no doubt artificial intelligence (AI) has the potential to improve access to mental health care. "One could imagine a world where AI serves as the'front line' for mental health, providing a clearinghouse of resources and available services for individuals seeking help,'' wrote the authors of the 2023 article "The Potential Influence of AI on Population Mental Health." Targeted interventions delivered digitally through chatbots "can help reduce the population burden of mental illness, particularly in hard-to-reach populations and contexts, for example, through stepped care approaches that aim to help populations with the highest risk following natural disasters," the article states. Besides Nomi, there are an increasing number of AI platforms people are using to create chatbots to take on several roles, including that of ad hoc therapist. Yet, while AI can assist in mental health management, it cannot replace human intuition. A trained therapist observes nuances that AI can't, such as body language, tone shifts, and unspoken emotions. Chatbots can be helpful, but mental health experts stress that they should never fully replace the human experience. That said, these mainstream chatbots are frequently being used for therapeutic purposes, as opposed to chatbots designed with mental health management in mind. Industry observers say the reasons are many: They provide emotional support when people are not ready to reach out to a therapist. They are anonymous, easy to use, convenient, available anytime, safe, judgment-free, affordable, and fast. These general-purpose chatbots help by providing comfort, validation, and a safe space for users to express themselves--all without the stigma that sometimes comes with traditional therapy settings. "Talking to a therapist can be intimidating, expensive, or complicated to access, and sometimes you need someone--or something--to listen at that exact moment,'' said Stephanie Lewis, a licensed clinical social worker and executive director of Epiphany Wellness addiction and mental health treatment centers.


How Age Influences the Interpretation of Emotional Body Language in Humanoid Robots -- long paper version

Consoli, Ilaria, Mattutino, Claudio, Gena, Cristina, de Carolis, Berardina, Palestra, Giuseppe

arXiv.org Artificial Intelligence

There is a general consensus that body movements and postures provide important cues for idennullfying emonullonal states, parnullcularly when facial and vocal signals are unavailable [1]. Emonullonal Body Language (EBL) is rapidly emerging as a significant area of research within cogninullve and affecnullve neuroscience. According to De Gelder [10], numerous valuable insights into human emonullon and its neurobiological foundanullons have been derived from the study of facial expressions. Indeed certain emonullons are more effecnullvely conveyed through facial expressions, while others are benuller commun icated through body movements or a combinanullon of both. Gestures provide observable cues that can be instrumental in recognizing and interprenullng a user's emonullonal state, especially in the absence of verbal or facial signals.


Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues

Kim, Youngmin, Chung, Jiwan, Kim, Jisoo, Lee, Sunghyun, Lee, Sangkyu, Kim, Junhyeok, Yang, Cheoljong, Yu, Youngjae

arXiv.org Artificial Intelligence

Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.


Augmented Body Communicator: Enhancing daily body expression for people with upper limb limitations through LLM and a robotic arm

Zhou, Songchen, Armstrong, Mark, Barbareschi, Giulia, Ajioka, Toshihiro, Hu, Zheng, Ando, Ryoichi, Yoshifuji, Kentaro, Muto, Masatane, Minamizawa, Kouta

arXiv.org Artificial Intelligence

Individuals with upper limb movement limitations face challenges in interacting with others. Although robotic arms are currently used primarily for functional tasks, there is considerable potential to explore ways to enhance users' body language capabilities during social interactions. This paper introduces an Augmented Body Communicator system that integrates robotic arms and a large language model. Through the incorporation of kinetic memory, disabled users and their supporters can collaboratively design actions for the robot arm. The LLM system then provides suggestions on the most suitable action based on contextual cues during interactions. The system underwent thorough user testing with six participants who have conditions affecting upper limb mobility. Results indicate that the system improves users' ability to express themselves. Based on our findings, we offer recommendations for developing robotic arms that support disabled individuals with body language capabilities and functional tasks.


'It's a new world': the analysts using AI to psychologically profile elite players

The Guardian

Listen to any pundit's post-match reaction and you will hear variations of that soundbite. But can you analyse an athlete's state of mind, based on their on-pitch body language? In an era when football is increasingly leaning on data to demonstrate physical attributes, statistics offering an accurate indication of a player's psychological qualities, such as emotional control and leadership, are harder to come by. But Premier League clubs including Brighton are using a technique intended to help in that regard with selection and recruitment. Thomas Tuchel made headlines by telling England's players to communicate more after he evaluated their interactions during the final of Euro 2024, but counting the number of times players gesture or talk to each other on the pitch tells only part of the mental battle being played out.


AI avatar generator Synthesia does video footage deal with Shutterstock

The Guardian

A 2bn ( 1.6bn) British startup that uses artificial intelligence to generate realistic avatars has struck a licensing deal with the stock footage firm Shutterstock to help develop its technology. Synthesia will pay the US-based Shutterstock an undisclosed sum to use its library of corporate video footage to train its latest AI model. It expects that incorporating the clips into its model will produce even more realistic expressions, vocal tones and body language from the avatars. "Thanks to this partnership with Shutterstock, we hope to try out new approaches that will … increase the realism and expressiveness of our AI generated avatars, bringing them closer to human-like performances," said Synthesia. Synthesia uses human actors to generate digital avatars of people, which are then deployed by companies in corporate videos in a range of scenarios such as advising on cybersecurity, calculating water bills and how to communicate better at work.


BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

Fu, Yumeng, Wu, Junjie, Wang, Zhongjie, Zhang, Meishan, Wu, Yulin, Liu, Bingquan

arXiv.org Artificial Intelligence

Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.


SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Jiang, Jianping, Xiao, Weiye, Lin, Zhengyu, Zhang, Huaizhong, Ren, Tianxiang, Gao, Yang, Lin, Zhiqian, Cai, Zhongang, Yang, Lei, Liu, Ziwei

arXiv.org Artificial Intelligence

Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.


Love in Action: Gamifying Public Video Cameras for Fostering Social Relationships in Real World

Zhang, Zhang, Li, Da, Wu, Geng, Li, Yaoning, Sun, Xiaobing, Wang, Liang

arXiv.org Artificial Intelligence

In this paper, we create "Love in Action" (LIA), a body language-based social game utilizing video cameras installed in public spaces to enhance social relationships in real-world. In the game, participants assume dual roles, i.e., requesters, who issue social requests, and performers, who respond social requests through performing specified body languages. To mediate the communication between participants, we build an AI-enhanced video analysis system incorporating multiple visual analysis modules like person detection, attribute recognition, and action recognition, to assess the performer's body language quality. A two-week field study involving 27 participants shows significant improvements in their social friendships, as indicated by Self-reported questionnaires. Moreover, user experiences are investigated to highlight the potential of public video cameras as a novel communication medium for socializing in public spaces.