Goto

Collaborating Authors

 facial movement


Morpheus: A Neural-driven Animatronic Face with Hybrid Actuation and Diverse Emotion Control

Zhang, Zongzheng, Yang, Jiawen, Peng, Ziqiao, Yang, Meng, Ma, Jianzhu, Cheng, Lin, Xu, Huazhe, Zhao, Hang, Zhao, Hao

arXiv.org Artificial Intelligence

Blue markers indicate the attachment points between the underlying mechanical structure and the soft skin, while yellow arrows denote the directions of movement. Blue arrows indicate the three-axis neck movement: nodding, shaking, and rotation. The green arrow illustrates the jaw's ability for horizontal movement in addition to typical opening and closing motions, enabling more diverse expressions. The first row illustrates the virtual expressions generated by our algorithm rendered in Blender, while the second row displays the corresponding real-world expressions reproduced by the animatronic face. Abstract --Previous animatronic faces struggle to express emotions effectively due to hardware and software limitations. On the hardware side, earlier approaches either use rigid-driven mechanisms, which provide precise control but are difficult to design within constrained spaces, or tendon-driven mechanisms, which are more space-efficient but challenging to control. In contrast, we propose a hybrid actuation approach that combines the best of both worlds. The eyes and mouth--key areas for emotional expression--are controlled using rigid mechanisms for precise movement, while the nose and cheek, which convey subtle facial microexpressions, are driven by strings. This design allows us to build a compact yet versatile hardware platform capable of expressing a wide range of emotions. On the algorithmic side, our method introduces a self-modeling network that maps motor actions to facial landmarks, allowing us to automatically establish the relationship between blendshape coefficients for different facial expressions and the corresponding motor control signals through gradient backpropagation. We then train a neural network to map speech input to corresponding blendshape controls. With our method, we can generate distinct emotional expressions such as happiness, fear, disgust, and anger, from any given sentence, each with nuanced, emotion-specific control signals--a feature that has not been demonstrated in earlier systems.


ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training

Zhang, Dong, Peng, Jingwei, Jiao, Yuyang, Gu, Jiayuan, Yu, Jingyi, Chen, Jiahao

arXiv.org Artificial Intelligence

-- This paper presents a novel Expressive Facial Control (ExFace) method based on Diffusion Transformers, which achieves precise mapping from human facial blend-shapes to bionic robot motor control. By incorporating an innovative model bootstrap training strategy, our approach not only generates high-quality facial expressions but also significantly improves accuracy and smoothness. Experimental results demonstrate that the proposed method outperforms previous methods in terms of accuracy, frame per second (FPS), and response time. Furthermore, we develop the ExFace dataset driven by human facial data. ExFace shows excellent real-time performance and natural expression rendering in applications such as robot performances and human-robot interactions, offering a new solution for bionic robot interaction. Facial expressions are integral to human communication, playing a pivotal role in the transmission of emotions, attitudes, and intentions. As evidenced in prior research, individuals rely on a variety of facial expressions to both convey and interpret affective states [1].


Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Chen, Hejia, Zhang, Haoxian, Zhang, Shoulong, Liu, Xiaoqiang, Zhuang, Sisi, Zhang, Yuan, Wan, Pengfei, Zhang, Di, Li, Shuai

arXiv.org Artificial Intelligence

Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/


Creepy robot toddler can mimic human expressions

Popular Science

Bipedal robots (at least some of them) are becoming increasingly agile and humanlike in their movements. Despite this, one physical aspect remains stuck in the uncanny valley--realistic facial expressions. One solution, however, may be found by treating expressions as an interplay between various "waveforms." The result is a new "dynamic arousal expression" system developed by researchers at Osaka University that allows a bot to mimic expressions more quickly and seamlessly than its predecessors. The potential solution, detailed in a study published in the Journal of Robotics and Mechatronics, requires first classifying various facial gestures like yawning, blinking, and breathing as individual waveforms.


3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

Sha, Xuanmeng, Zhang, Liyun, Mashita, Tomohiro, Uranishi, Yuki

arXiv.org Artificial Intelligence

Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.


Dogs have lost their ability to convey facial expressions compared with their wolf ancestors due to domestication, study says

Daily Mail - Science & tech

Most dog owners will insist they can tell what their pooch is thinking from their face alone. But man's best friend used to be even more expressive, according to a new study. Researchers have discovered that the domestication process has resulted in the loss of some communication abilities in today's dogs compared to their wolf ancestors. The team, from Durham University, used a'Dog Facial Action Coding System' to analyse video recordings of captive wolves and domestic dogs. This was during both spontaneous social interactions and reactions to external stimuli, for example a squeaky toy.


DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer

Ma, Zhiyuan, Zhu, Xiangyu, Qi, Guojun, Qian, Chen, Zhang, Zhaoxiang, Lei, Zhen

arXiv.org Artificial Intelligence

Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. However, their mere aggregation does not lead to improved performance. We suspect this is due to a shortage of paired audio-4D data, which is crucial for the Transformer to effectively perform as a denoiser within the Diffusion framework. To tackle this issue, we present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules. These modules serve as substitutes for the traditional self/cross-attention in standard Transformers, incorporating thoughtfully designed biases that steer the attention mechanisms to concentrate on both the relevant task-specific and diffusion-related conditions. We also explore the trade-off between accurate lip synchronization and non-verbal facial expressions within the Diffusion paradigm. Experiments show our model not only achieves state-of-the-art performance on existing benchmarks, but also fast inference speed owing to its ability to generate facial motions in parallel.


A Closer Look at Geometric Temporal Dynamics for Face Anti-Spoofing

Chang, Chih-Jung, Lee, Yaw-Chern, Yao, Shih-Hsuan, Chen, Min-Hung, Wang, Chien-Yi, Lai, Shang-Hong, Chen, Trista Pei-Chun

arXiv.org Artificial Intelligence

Face anti-spoofing (FAS) is indispensable for a face recognition system. Many texture-driven countermeasures were developed against presentation attacks (PAs), but the performance against unseen domains or unseen spoofing types is still unsatisfactory. Instead of exhaustively collecting all the spoofing variations and making binary decisions of live/spoof, we offer a new perspective on the FAS task to distinguish between normal and abnormal movements of live and spoof presentations. We propose Geometry-Aware Interaction Network (GAIN), which exploits dense facial landmarks with spatio-temporal graph convolutional network (ST-GCN) to establish a more interpretable and modularized FAS model. Additionally, with our cross-attention feature interaction mechanism, GAIN can be easily integrated with other existing methods to significantly boost performance. Our approach achieves state-of-the-art performance in the standard intra- and cross-dataset evaluations. Moreover, our model outperforms state-of-the-art methods by a large margin in the cross-dataset cross-type protocol on CASIA-SURF 3DMask (+10.26% higher AUC score), exhibiting strong robustness against domain shifts and unseen spoofing types.


Emotion AI: Can artificial intelligence really read humans? - disruptor.news

#artificialintelligence

The human questioner may not be sure "I'm good" is a fact. But artificial intelligence (AI) and machine learning (ML) engineers claim that new technologies known as "emotion AI" can observe people and accurately assess how they're feeling. AI is all around us, whether we know it or not. It enables mainstream social media platforms to pitch smart personalization; virtual healthcare assistants to help nurses with burnout prevention; integrated smart assistants in electronic devices to perform various tasks, and much more. Artificial emotional intelligence systems go further.


A virtual reality-based method for examining audiovisual prosody perception

Meister, Hartmut, Winter, Isa Samira, Waeachtler, Moritz, Sandmann, Pascale, Abdellatif, Khaled

arXiv.org Artificial Intelligence

Prosody plays a vital role in verbal communication. Acoustic cues of prosody have been examined extensively. However, prosodic characteristics are not only perceived auditorily, but also visually based on head and facial movements. The purpose of this report is to present a method for examining audiovisual prosody using virtual reality. We show that animations based on a virtual human provide motion cues similar to those obtained from video recordings of a real talker. The use of virtual reality opens up new avenues for examining multimodal effects of verbal communication. We discuss the method in the framework of examining prosody perception in cochlear implant listeners.