Goto

Collaborating Authors

 Li, Xingfeng


TPCH: Tensor-interacted Projection and Cooperative Hashing for Multi-view Clustering

arXiv.org Artificial Intelligence

In recent years, anchor and hash-based multi-view clustering methods have gained attention for their efficiency and simplicity in handling large-scale data. However, existing methods often overlook the interactions among multi-view data and higher-order cooperative relationships during projection, negatively impacting the quality of hash representation in low-dimensional spaces, clustering performance, and sensitivity to noise. To address this issue, we propose a novel approach named Tensor-Interacted Projection and Cooperative Hashing for Multi-View Clustering(TPCH). TPCH stacks multiple projection matrices into a tensor, taking into account the synergies and communications during the projection process. By capturing higher-order multi-view information through dual projection and Hamming space, TPCH employs an enhanced tensor nuclear norm to learn more compact and distinguishable hash representations, promoting communication within and between views. Experimental results demonstrate that this refined method significantly outperforms state-of-the-art methods in clustering on five large-scale multi-view datasets. Moreover, in terms of CPU time, TPCH achieves substantial acceleration compared to the most advanced current methods. The code is available at \textcolor{red}{\url{https://github.com/jankin-wang/TPCH}}.


MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, ASR Error Detection, and ASR Error Correction

arXiv.org Artificial Intelligence

The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities. We refer to our method as MF-AED-AEC. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1\%.


On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

arXiv.org Artificial Intelligence

Typically, three common approaches are used to address the issue of noisy This paper proposes an efficient attempt to noisy speech emotion speech emotion recognition (NSER): the signal level, the feature recognition (NSER). Conventional NSER approaches level, and the model level, as outlined by Tiwari et al have proven effective in mitigating the impact of artificial [2]. For instance, Pandharipande et al. [3] used a voice activity noise sources, such as white Gaussian noise, but are limited detector to reduce noise at the signal level. Lachiri et to non-stationary noises in real-world environments due to al. [4] introduced a novel approach involving MFCC-shifteddelta-cepstral their complexity and uncertainty. To overcome this limitation, coefficients at the feature level. Tiwari et al. [2] we introduce a new method for NSER by adopting the devised a generative noise model at the model level. The previously automatic speech recognition (ASR) model as a noise-robust mentioned studies have proven effective in mitigating feature extractor to eliminate non-vocal information in noisy the impact of common noise sources like white Gaussian speech. We first obtain intermediate layer information from noise on speech-related tasks. However, in real-world settings, the ASR model as a feature representation for emotional a distinct category of noise sounds, such as high-heeled speech and then apply this representation for the downstream shoes and door knocking, presents a substantial challenge.