Goto

Collaborating Authors

 different emotion


WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

arXiv.org Artificial Intelligence

Speech emotion recognition (SER) remains a challenging yet crucial task due to the inherent complexity and diversity of human emotions. To address this problem, researchers attempt to fuse information from other modalities via multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of cross-modal interactions, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal speech emotion recognition framework that addresses critical research problems in effective multimodal fusion, heterogeneity among modalities, and discriminative representation learning. By leveraging a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion demonstrates improved performance over existing state-of-the-art methods on benchmark datasets. Our work highlights the importance of capturing nuanced cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results on two benchmark datasets (IEMOCAP and MELD) demonstrate that WavFusion succeeds over the state-of-the-art strategies on emotion recognition.


CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

arXiv.org Artificial Intelligence

Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.


Emotional Listener Portrait: Neural Listener Head Generation with Emotion

arXiv.org Artificial Intelligence

Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotion in conversation. Benefiting from the ``explicit'' and ``discrete'' design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods.


Emotion Selectable End-to-End Text-based Speech Editing

arXiv.org Artificial Intelligence

Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/


MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion Recognition

arXiv.org Artificial Intelligence

Gait emotion recognition plays a crucial role in the intelligent system. Most of the existing methods recognize emotions by focusing on local actions over time. However, they ignore that the effective distances of different emotions in the time domain are different, and the local actions during walking are quite similar. Thus, emotions should be represented by global states instead of indirect local actions. To address these issues, a novel Multi Scale Adaptive Graph Convolution Network (MSA-GCN) is presented in this work through constructing dynamic temporal receptive fields and designing multiscale information aggregation to recognize emotions. In our model, a adaptive selective spatial-temporal graph convolution is designed to select the convolution kernel dynamically to obtain the soft spatio-temporal features of different emotions. Moreover, a Cross-Scale mapping Fusion Mechanism (CSFM) is designed to construct an adaptive adjacency matrix to enhance information interaction and reduce redundancy. Compared with previous state-of-the-art methods, the proposed method achieves the best performance on two public datasets, improving the mAP by 2\%. We also conduct extensive ablations studies to show the effectiveness of different components in our methods.


Analysis of impact of emotions on target speech extraction and speech separation

arXiv.org Artificial Intelligence

Recently, the performance of blind speech separation (BSS) and target speech extraction (TSE) has greatly progressed. Most works, however, focus on relatively well-controlled conditions using, e.g., read speech. The performance may degrade in more realistic situations. One of the factors causing such degradation may be intrinsic speaker variability, such as emotions, occurring commonly in realistic speech. In this paper, we investigate the influence of emotions on TSE and BSS. We create a new test dataset of emotional mixtures for the evaluation of TSE and BSS. This dataset combines LibriSpeech and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Through controlled experiments, we can analyze the impact of different emotions on the performance of BSS and TSE. We observe that BSS is relatively robust to emotions, while TSE, which requires identifying and extracting the speech of a target speaker, is much more sensitive to emotions. On comparative speaker verification experiments we show that identifying the target speaker may be particularly challenging when dealing with emotional speech. Using our findings, we outline potential future directions that could improve the robustness of BSS and TSE systems toward emotional speech.


Using a New Interactive Interface Shows How Music Listeners Think Different Emotions Sound as Music - Neuroscience News

#artificialintelligence

Summary: A new computer interface allowed participants to convey their emotions through music by changing elements of the musical tune. New research conducted by experts from Durham University's Department of Music found that people are able to convey particular emotions through music by changing certain elements of the musical tune. The researchers created an interactive computer interface called EmoteControl which allows users to control six cues (tempo, pitch, articulation, dynamics, brightness, and mode) of a musical piece in real-time. The participants were asked to show how they think seven different emotions (sadness, calmness, joy, anger, fear, power, and surprise) should sound as music. They did this by changing the musical cues in EmoteControl, essentially allowing them to create their own variations of a range of music pieces that portrayed different emotions.


First ever interactive AUDIO map lets you HEAR emotions

Daily Mail - Science & tech

Scientists have found that involuntary sounds we make when we express shock, elation and fear reveal a lot more about what we feel than previously thought. An interactive audio map shows more than 2000 sounds for a range of 24 different emotions like fear, surprise (positive and negative), embarrassment, elation and ecstasy. The results are demonstrated in vivid sound and colour on the map allows you to move the cursor along it and hear the varying sounds. Spontaneous sounds like'woohoo' to convey excitement and'argh' to show anger say a lot more about what we're feeling than previously understood, according to new research by Berkeley University. Scientists conducted a statistical analysis of responses to more than 2,000 nonverbal exclamations known as'vocal bursts' to discover that there are thousands of different sounds for varying types of emotion.


How we read emotions on people's faces may say more about our perceptions of what others are feeling

Daily Mail - Science & tech

How we read emotions on other people's faces may say more about us than it does about them, a new study suggests. Experts found the cues that we use to judge the emotion behind a facial expression can vary highly from person to person. This is a dramatic departure from previous scientific understanding, which said the ability to identify six key emotions –anger, disgust, happiness, fear, sadness, and surprise – was universal across cultures and genetically hard-wired in humans. However, the latest results shows everyone conceptualises emotions differently within their own minds, making it more difficult to read how other people are feeling. For example, some people might find it hard to differentiate between sadness and anger if they associate both these emotions with actions like crying, shouting, or slamming fists on the table.


Personalized "deep learning" equips robots for autism therapy

#artificialintelligence

Children with autism spectrum conditions often have trouble recognizing the emotional states of people around them -- distinguishing a happy face from a fearful face, for instance. To remedy this, some therapists use a kid-friendly robot to demonstrate those emotions and to engage the children in imitating the emotions and responding to them in appropriate ways. This type of therapy works best, however, if the robot can smoothly interpret the child's own behavior -- whether he or she is interested and excited or paying attention -- during the therapy. Researchers at the MIT Media Lab have now developed a type of personalized machine learning that helps robots estimate the engagement and interest of each child during these interactions, using data that are unique to that child. Armed with this personalized "deep learning" network, the robots' perception of the children's responses agreed with assessments by human experts, with a correlation score of 60 percent, the scientists report June 27 in Science Robotics.