Goto

Collaborating Authors

 Zhao, Jianjun


Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling

arXiv.org Artificial Intelligence

Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems struggle to fully reproduce paralinguistic information in highly expressive speech, such as breathing, crying, and emotional nuances, limiting their practical applicability. To address these issues, we propose Takin-VC, a novel expressive zero-shot VC framework via adaptive hybrid content encoding and memory-augmented context-aware timbre modeling. Specifically, we introduce an innovative hybrid content encoder that incorporates an adaptive fusion module, capable of effectively integrating quantized features of the pre-trained WavLM and HybridFormer in an implicit manner, so as to extract precise linguistic features while enriching paralinguistic elements. For timbre modeling, we propose advanced memory-augmented and context-aware modules to generate high-quality target timbre features and fused representations that seamlessly align source content with target timbre. To enhance real-time performance, we advocate a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. Experimental results show that our Takin-VC consistently surpasses state-of-the-art VC systems, achieving notable improvements in terms of speech naturalness, speech expressiveness, and speaker similarity, while offering enhanced inference speed.


CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching

arXiv.org Artificial Intelligence

Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that leverages Content-aware Timbre Ensemble modeling and Flow Matching. Specifically, CTEFM-VC disentangles utterances into linguistic content and timbre representations, subsequently utilizing a conditional flow matching model and a vocoder to reconstruct the mel-spectrogram and waveform. To enhance its timbre modeling capability and the naturalness of generated speech, we propose a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the joint utilization of linguistic and timbre features through a cross-attention module. Experiments show that our CTEFM-VC system surpasses state-of-the-art VC methods in both speaker similarity and naturalness by at least 18.5% and 7.0%.


A Coverage-Guided Testing Framework for Quantum Neural Networks

arXiv.org Artificial Intelligence

Quantum Neural Networks (QNNs) combine quantum computing and neural networks, leveraging quantum properties such as superposition and entanglement to improve machine learning models. These quantum characteristics enable QNNs to potentially outperform classical neural networks in tasks such as quantum chemistry simulations, optimization problems, and quantum-enhanced machine learning. However, they also introduce significant challenges in verifying the correctness and reliability of QNNs. To address this, we propose QCov, a set of test coverage criteria specifically designed for QNNs to systematically evaluate QNN state exploration during testing, focusing on superposition and entanglement. These criteria help detect quantum-specific defects and anomalies. Quantum Neural Networks (QNNs) Cong et al. (2018) represent a significant advancement in computational technology, combining the principles of quantum mechanics with neural network mechanisms. By leveraging quantum properties such as superposition and entanglement, QNNs have the potential to solve complex problems more efficiently than classical neural networks, particularly in areas like image classification Li et al. (2022b); Shi et al. (2023); Henderson et al. (2019); Alam et al. (2021) and sequential data learning Bausch (2020); Yu et al. (2024). Despite this early success, similar to Deep Neural Networks (DNNs) LeCun et al. (1998a); He et al. (2015); Howard et al. (2017), QNNs have been shown to be vulnerable to adversarial Lu et al. (2019) and backdoor attacks Chu et al. (2023a;b), raising concerns about their security and robustness. A recent work, QuanTest Shi et al. (2024), introduced the first adversarial testing framework for QNNs, using an entanglement-guided optimization algorithm to generate adversarial inputs and capture erroneous behaviors. However, QuanTest focuses primarily on individual inputs, lacking a comprehensive evaluation of overall test adequacy for QNNs. Additionally, due to the complexity of the Hilbert space, which grows exponentially with the number of qubits, it is impractical to manually test QNNs thoroughly. This highlights the urgent need for a comprehensive testing framework to assess the test adequacy of QNNs. To ensure system quality, numerous testing techniques have been developed for deep learning (DL) systems Zhang et al. (2020); Wang et al. (2024) and traditional quantum software Wang et al. (2021a;b); Fortunato et al. (2022a); Xia et al. (2024) from various perspectives.


GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

arXiv.org Artificial Intelligence

The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches.


PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

arXiv.org Artificial Intelligence

With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.


DPGAN: A Dual-Path Generative Adversarial Network for Missing Data Imputation in Graphs

arXiv.org Artificial Intelligence

Missing data imputation poses a paramount challenge when dealing with graph data. Prior works typically are based on feature propagation or graph autoencoders to address this issue. However, these methods usually encounter the over-smoothing issue when dealing with missing data, as the graph neural network (GNN) modules are not explicitly designed for handling missing data. This paper proposes a novel framework, called Dual-Path Generative Adversarial Network (DPGAN), that can deal simultaneously with missing data and avoid over-smoothing problems. The crux of our work is that it admits both global and local representations of the input graph signal, which can capture the long-range dependencies. It is realized via our proposed generator, consisting of two key components, i.e., MLPUNet++ and GraphUNet++. Our generator is trained with a designated discriminator via an adversarial process. In particular, to avoid assessing the entire graph as did in the literature, our discriminator focuses on the local subgraph fidelity, thereby boosting the quality of the local imputation. The subgraph size is adjustable, allowing for control over the intensity of adversarial regularization. Comprehensive experiments across various benchmark datasets substantiate that DPGAN consistently rivals, if not outperforms, existing state-of-the-art imputation algorithms. The code is provided at \url{https://github.com/momoxia/DPGAN}.


PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders

arXiv.org Artificial Intelligence

Neural speech codec has recently gained widespread attention in generative speech modeling domains, like voice conversion, text-to-speech synthesis, etc. However, ensuring high-fidelity audio reconstruction of speech codecs under low bitrate remains an open and challenging issue. In this paper, we propose PromptCodec, a novel end-to-end neural speech codec using feature-aware prompt encoders based on disentangled representation learning. By incorporating prompt encoders to capture representations of additional input prompts, PromptCodec can distribute the speech information requiring processing and enhance its representation capabilities. Moreover, a simple yet effective adaptive feature weighted fusion approach is introduced to integrate features of different encoders. Meanwhile, we propose a novel disentangled representation learning strategy based on structure similarity index measure to optimize PromptCodec's encoders to ensure their efficiency, thereby further improving the performance of PromptCodec. Experiments on LibriTTS demonstrate that our proposed PromptCodec consistently outperforms state-of-the-art neural speech codec models under all different bitrate conditions while achieving superior performance with low bitrates.


GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

arXiv.org Artificial Intelligence

Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pre-trained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best WAR of 83.16\%, which performs better than state-of-the-art SER methods.


MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition

arXiv.org Artificial Intelligence

Despite significant progress, speech emotion recognition (SER) remains challenging due to inherent complexity and ambiguity of the emotion attribute, particularly in wild world. Whereas current studies primarily focus on recognition and generalization abilities, this work pioneers an investigation into the reliability of SER methods and explores the modeling of speech emotion based on data distribution across various speech attributes. Specifically, a novel CNN-based SER model that adopts additive margin softmax loss is first desgined. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic features and extract fine-grained emotion-related representations. Third, we make a first attempt to examine the reliability of our proposed unified SER workflow using the out-of-distribution detection method. Experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in all aspects. Remarkably, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97% and a UAR of 71.76% on the IEMOCAP corpus.


On the Effectiveness of Hybrid Pooling in Mixup-Based Graph Learning for Language Processing

arXiv.org Artificial Intelligence

Graph neural network (GNN)-based graph learning has been popular in natural language and programming language processing, particularly in text and source code classification. Typically, GNNs are constructed by incorporating alternating layers which learn transformations of graph node features, along with graph pooling layers that use graph pooling operators (e.g., Max-pooling) to effectively reduce the number of nodes while preserving the semantic information of the graph. Recently, to enhance GNNs in graph learning tasks, Manifold-Mixup, a data augmentation technique that produces synthetic graph data by linearly mixing a pair of graph data and their labels, has been widely adopted. However, the performance of Manifold-Mixup can be highly affected by graph pooling operators, and there have not been many studies that are dedicated to uncovering such affection. To bridge this gap, we take an early step to explore how graph pooling operators affect the performance of Mixup-based graph learning. To that end, we conduct a comprehensive empirical study by applying Manifold-Mixup to a formal characterization of graph pooling based on 11 graph pooling operations (9 hybrid pooling operators, 2 non-hybrid pooling operators). The experimental results on both natural language datasets (Gossipcop, Politifact) and programming language datasets (JAVA250, Python800) demonstrate that hybrid pooling operators are more effective for Manifold-Mixup than the standard Max-pooling and the state-of-the-art graph multiset transformer (GMT) pooling, in terms of producing more accurate and robust models.