Zhao, Ruoyu
Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition
Zhao, Ruoyu, Jiang, Xiantao, Yu, F. Richard, Leung, Victor C. M., Wang, Tao, Zhang, Shaohu
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in linguistic and acoustic features of different languages. In this study, we propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source dataset to train the source model and evaluate the proposed method on seven datasets in five languages (e.g., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB (German language) and 79.48% on EMOVO (Italian language). Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
Textualize Visual Prompt for Image Editing via Diffusion Bridge
Xu, Pengcheng, Fan, Qingnan, Kou, Fei, Qin, Shuai, Gu, Hong, Zhao, Ruoyu, Ling, Charles, Wang, Boyu
Visual prompt, a pair of before-and-after edited images, can convey indescribable imagery transformations and prosper in image editing. However, current visual prompt methods rely on a pretrained text-guided image-to-image generative model that requires a triplet of text, before, and after images for retraining over a text-to-image model. Such crafting triplets and retraining processes limit the scalability and generalization of editing. In this paper, we present a framework based on any single text-to-image model without reliance on the explicit image-to-image model thus enhancing the generalizability and scalability. Specifically, by leveraging the probability-flow ordinary equation, we construct a diffusion bridge to transfer the distribution between before-and-after images under the text guidance. By optimizing the text via the bridge, the framework adaptively textualizes the editing transformation conveyed by visual prompts into text embeddings without other models. Meanwhile, we introduce differential attention control during text optimization, which disentangles the text embedding from the invariance of the before-and-after images and makes it solely capture the delicate transformation and generalize to edit various images. Experiments on real images validate competitive results on the generalization, contextual coherence, and high fidelity for delicate editing with just one image pair as the visual prompt.
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
Zheng, Shuhong, Bao, Zhipeng, Zhao, Ruoyu, Hebert, Martial, Wang, Yu-Xiong
Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism. Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.
SPikE-SSM: A Sparse, Precise, and Efficient Spiking State Space Model for Long Sequences Learning
Zhong, Yan, Zhao, Ruoyu, Wang, Chao, Guo, Qinghai, Zhang, Jianguo, Lu, Zhichao, Leng, Luziwei
Spiking neural networks (SNNs) provide a low-power, energy-efficient solution by utilizing the spike-based and sparse nature of biological systems. Since the advent of Transformers, SNNs have struggled to compete with artificial networks on long sequential tasks, until the recent emergence of state space models (SSMs), which offer superior computational efficiency and modeling capability. However, applying the highly capable SSMs to SNNs for long sequences learning poses three major challenges: The membrane potential is determined by the past spiking history of the neuron, leading to reduced efficiency for sequence modeling in parallel computing scenarios. Complex dynamics of biological spiking neurons are crucial for functionality but challenging to simulate and exploit effectively in large networks. It is arduous to maintain high sparsity while achieving high accuracy for spiking neurons without resorting to dense computing, as utilized in artificial neuron-based SSMs. To address these challenges, we propose a sparse, precise and efficient spiking SSM framework, termed SPikE-SSM. For, we propose a boundary compression strategy (PMBC) to accelerate the inference of the spiking neuron model, enabling parallel processing for long sequence learning. For, we propose a novel and concise neuron model incorporating reset-refractory mechanism to leverage the inherent temporal dimension for dynamic computing with biological interpretability. For, we hierarchically integrate the proposed neuron model to the original SSM block, and enhance the dynamics of SPikE-SSM by incorporating trainable thresholds and refractory magnitudes to balance accuracy and sparsity. Extensive experiments illustrate the effectiveness and robustness of SPikE-SSM on the long range arena benchmarks and large language dataset WikiText-103, showing the potential of dynamic spiking neurons in efficient long sequence learning. The code will be publicly available.