AITopics | iemocap

Collaborating Authors

iemocap

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PaSE: Prototype-aligned Calibration and Shapley-based Equilibrium for Multimodal Sentiment Analysis

He, Kang, Chen, Boyu, Ding, Yuzhe, Li, Fei, Teng, Chong, Ji, Donghong

arXiv.org Artificial IntelligenceNov-26-2025

Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by integrating textual, acoustic, and visual signals. Although multimodal fusion is designed to leverage cross-modal complementarity, real-world scenarios often exhibit modality competition: dominant modalities tend to overshadow weaker ones, leading to suboptimal performance. In this paper, we propose PaSE, a novel Prototype-aligned Calibration and Shapley-optimized Equilibrium framework, which enhances collaboration while explicitly mitigating modality competition. PaSE first applies Prototype-guided Calibration Learning (PCL) to refine unimodal representations and align them through an Entropic Optimal Transport mechanism that ensures semantic consistency. To further stabilize optimization, we introduce a Dual-Phase Optimization strategy. A prototype-gated fusion module is first used to extract shared representations, followed by Shapley-based Gradient Modulation (SGM), which adaptively adjusts gradients according to the contribution of each modality. Extensive experiments on IEMOCAP, MOSI, and MOSEI confirm that PaSE achieves the superior performance and effectively alleviates modality competition.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.17585

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.49)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.35)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.34)

Add feedback

Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion

Wang, Zanxu, Beigi, Homayoon

arXiv.org Artificial IntelligenceNov-20-2025

This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.13140/RG.2.2.33632.55045

2511.14969

Country:

North America (0.28)
Asia > India (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates

Zhao, Binyu, Zhang, Wei, Zou, Zhaonian

arXiv.org Artificial IntelligenceNov-11-2025

Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The final published version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.patcog.2025.112591

2510.10534

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations

Liu, Yu, Shi, Hanlei, Li, Haoxun, Sun, Yuqing, Ding, Yuxuan, Gong, Linlin, Qu, Leyuan, Li, Taihao

arXiv.org Artificial IntelligenceOct-13-2025

Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.08606

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.74)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

EmoHRNet: High-Resolution Neural Network Based Speech Emotion Recognition

Muppidi, Akshay, Radfar, Martin

arXiv.org Artificial IntelligenceOct-8-2025

Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.

artificial intelligence, emotion recognition, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP48485

2510.06072

Country:

North America > United States (0.14)
Europe > Iceland (0.14)

Genre: Research Report > Promising Solution (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Tan, Weiting, Lian, Jiachen, Inaguma, Hirofumi, Tomasello, Paden, Koehn, Philipp, Ma, Xutai

arXiv.org Artificial IntelligenceAug-29-2025

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.16188

Country: Asia (0.46)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture

Mandal, Nischal, Li, Yang

arXiv.org Artificial IntelligenceMay-9-2025

Multimodal sentiment analysis, a pivotal task in affective computing, seeks to understand human emotions by integrating cues from language, audio, and visual signals. While many recent approaches leverage complex attention mechanisms and hierarchical architectures, we propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification. Using the benchmark IEMOCAP dataset, which includes aligned text, audio-derived numeric features, and visual descriptors, we design a modality-specific encoder using fully connected layers followed by dropout regularization. The modality-specific representations are then fused using simple concatenation and passed through a dense fusion layer to capture cross-modal interactions. This streamlined architecture avoids computational overhead while preserving performance, achieving a classification accuracy of 92% across six emotion categories. Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models, particularly in resource-constrained environments.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.04642

Country: Europe > United Kingdom > England > Greater London > London (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(2 more...)

Add feedback

OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs

Murzaku, John, Rambow, Owen

arXiv.org Artificial IntelligenceMar-28-2025

The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2503.2148

Country:

Europe > Portugal > Lisbon > Lisbon (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(3 more...)

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition

Zhao, Ruoyu, Jiang, Xiantao, Yu, F. Richard, Leung, Victor C. M., Wang, Tao, Zhang, Shaohu

arXiv.org Artificial IntelligenceJan-6-2025

Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in linguistic and acoustic features of different languages. In this study, we propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source dataset to train the source model and evaluate the proposed method on seven datasets in five languages (e.g., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB (German language) and 79.48% on EMOVO (Italian language). Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.

data mining, emotion recognition, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.10408

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
North America > United States > North Carolina (0.04)
Asia > Singapore (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(5 more...)

Add feedback

TED: Turn Emphasis with Dialogue Feature Attention for Emotion Recognition in Conversation

Ono, Junya, Wakaki, Hiromi

arXiv.org Artificial IntelligenceJan-2-2025

Emotion recognition in conversation (ERC) has been attracting attention by methods for modeling multi-turn contexts. The multi-turn input to a pretraining model implicitly assumes that the current turn and other turns are distinguished during the training process by inserting special tokens into the input sequence. This paper proposes a priority-based attention method to distinguish each turn explicitly by adding dialogue features into the attention mechanism, called Turn Emphasis with Dialogue (TED). It has a priority for each turn according to turn position and speaker information as dialogue features. It takes multi-head self-attention between turn-based vectors for multi-turn input and adjusts attention scores with the dialogue features. We evaluate TED on four typical benchmarks. The experimental results demonstrate that TED has high overall performance in all datasets and achieves state-of-the-art performance on IEMOCAP with numerous turns.

computational linguistic, current turn, speaker information, (14 more...)

arXiv.org Artificial Intelligence

2501.01123

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > China (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.62)

Add feedback