Nishida, Shin'ya
HAPI: A Model for Learning Robot Facial Expressions from Human Preferences
Yang, Dongsheng, Liu, Qianying, Sato, Wataru, Minato, Takashi, Liu, Chaoran, Nishida, Shin'ya
Automatic robotic facial expression generation is crucial for human-robot interaction, as handcrafted methods based on fixed joint configurations often yield rigid and unnatural behaviors. Although recent automated techniques reduce the need for manual tuning, they tend to fall short by not adequately bridging the gap between human preferences and model predictions-resulting in a deficiency of nuanced and realistic expressions due to limited degrees of freedom and insufficient perceptual integration. In this work, we propose a novel learning-to-rank framework that leverages human feedback to address this discrepancy and enhanced the expressiveness of robotic faces. Specifically, we conduct pairwise comparison annotations to collect human preference data and develop the Human Affective Pairwise Impressions (HAPI) model, a Siamese RankNet-based approach that refines expression evaluation. Results obtained via Bayesian Optimization and online expression survey on a 35-DOF android platform demonstrate that our approach produces significantly more realistic and socially resonant expressions of Anger, Happiness, and Surprise than those generated by baseline and expert-designed methods. This confirms that our framework effectively bridges the gap between human preferences and model predictions while robustly aligning robotic expression generation with human affective responses.
Machine Learning Modeling for Multi-order Human Visual Motion Processing
Sun, Zitang, Chen, Yen-Ju, Yang, Yung-Hao, Li, Yuan, Nishida, Shin'ya
Our research aims to develop machines that learn to perceive visual motion as do humans. While recent advances in computer vision (CV) have enabled DNN-based models to accurately estimate optical flow in naturalistic images, a significant disparity remains between CV models and the biological visual system in both architecture and behavior. This disparity includes humans' ability to perceive the motion of higher-order image features (second-order motion), which many CV models fail to capture because of their reliance on the intensity conservation law. Our model architecture mimics the cortical V1-MT motion processing pathway, utilizing a trainable motion energy sensor bank and a recurrent graph network. Supervised learning employing diverse naturalistic videos allows the model to replicate psychophysical and physiological findings about first-order (luminance-based) motion perception. For second-order motion, inspired by neuroscientific findings, the model includes an additional sensing pathway with nonlinear preprocessing before motion energy sensing, implemented using a simple multilayer 3D CNN block. When exploring how the brain acquired the ability to perceive second-order motion in natural environments, in which pure second-order signals are rare, we hypothesized that second-order mechanisms were critical when estimating robust object motion amidst optical fluctuations, such as highlights on glossy surfaces. We trained our dual-pathway model on novel motion datasets with varying material properties of moving objects. We found that training to estimate object motion from non-Lambertian materials naturally endowed the model with the capacity to perceive second-order motion, as can humans. The resulting model effectively aligns with biological systems while generalizing to both first- and second-order motion phenomena in natural scenes.
Modelling Human Visual Motion Processing with Trainable Motion Energy Sensing and a Self-attention Network
Sun, Zitang, Chen, Yen-Ju, Yang, Yung-hao, Nishida, Shin'ya
Visual motion processing is essential for humans to perceive and interact with dynamic environments. Despite extensive research in cognitive neuroscience, image-computable models that can extract informative motion flow from natural scenes in a manner consistent with human visual processing have yet to be established. Meanwhile, recent advancements in computer vision (CV), propelled by deep learning, have led to significant progress in optical flow estimation, a task closely related to motion perception. Here we propose an image-computable model of human motion perception by bridging the gap between biological and CV models. Specifically, we introduce a novel two-stages approach that combines trainable motion energy sensing with a recurrent self-attention network for adaptive motion integration and segregation. This model architecture aims to capture the computations in V1-MT, the core structure for motion perception in the biological visual system, while providing the ability to derive informative motion flow for a wide range of stimuli, including complex natural scenes. In silico neurophysiology reveals that our model's unit responses are similar to mammalian neural recordings regarding motion pooling and speed tuning. The proposed model can also replicate human responses to a range of stimuli examined in past psychophysical studies. The experimental results on the Sintel benchmark demonstrate that our model predicts human responses better than the ground truth, whereas the state-of-the-art CV models show the opposite. Our study provides a computational architecture consistent with human visual motion processing, although the physiological correspondence may not be exact.
Optimizing Facial Expressions of an Android Robot Effectively: a Bayesian Optimization Approach
Yang, Dongsheng, Sato, Wataru, Liu, Qianying, Minato, Takashi, Namba, Shushi, Nishida, Shin'ya
Expressing various facial emotions is an important social ability for efficient communication between humans. A key challenge in human-robot interaction research is providing androids with the ability to make various human-like facial expressions for efficient communication with humans. The android Nikola, we have developed, is equipped with many actuators for facial muscle control. While this enables Nikola to simulate various human expressions, it also complicates identification of the optimal parameters for producing desired expressions. Here, we propose a novel method that automatically optimizes the facial expressions of our android. We use a machine vision algorithm to evaluate the magnitudes of seven basic emotions, and employ the Bayesian Optimization algorithm to identify the parameters that produce the most convincing facial expressions. Evaluations by naive human participants demonstrate that our method improves the rated strength of the android's facial expressions of anger, disgust, sadness, and surprise compared with the previous method that relied on Ekman's theory and parameter adjustments by a human expert.