Yi, Jiangyan
Region-Based Optimization in Continual Learning for Audio Deepfake Detection
Chen, Yujie, Yi, Jiangyan, Fan, Cunhang, Tao, Jianhua, Ren, Yong, Zeng, Siding, Zhang, Chu Yuan, Yan, Xinrui, Gu, Hao, Xue, Jun, Wang, Chenglong, Lv, Zhao, Zhang, Xiaohui
Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based Optimization (RegO) for audio deepfake detection. Specifically, we use the Fisher information matrix to measure important neuron regions for real and fake audio detection, dividing them into four regions. First, we directly fine-tune the less important regions to quickly adapt to new tasks. Next, we apply gradient optimization in parallel for regions important only to real audio detection, and in orthogonal directions for regions important only to fake audio detection. For regions that are important to both, we use sample proportion-based adaptive gradient optimization. This region-adaptive optimization ensures an appropriate trade-off between memory stability and learning plasticity. Additionally, to address the increase of redundant neurons from old tasks, we further introduce the Ebbinghaus forgetting mechanism to release them, thereby promoting the capability of the model to learn more generalized discriminative features. Experimental results show our method achieves a 21.3% improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. Moreover, the effectiveness of RegO extends beyond the audio deepfake detection domain, showing potential significance in other tasks, such as image recognition. The code is available at https://github.com/cyjie429/RegO
Reject Threshold Adaptation for Open-Set Model Attribution of Deepfake Audio
Yan, Xinrui, Yi, Jiangyan, Tao, Jianhua, Chen, Yujie, Gu, Hao, Li, Guanjun, Zhou, Junzuo, Ren, Yong, Xu, Tao
Open environment oriented open set model attribution of deepfake audio is an emerging research topic, aiming to identify the generation models of deepfake audio. Most previous work requires manually setting a rejection threshold for unknown classes to compare with predicted probabilities. However, models often overfit training instances and generate overly confident predictions. Moreover, thresholds that effectively distinguish unknown categories in the current dataset may not be suitable for identifying known and unknown categories in another data distribution. To address the issues, we propose a novel framework for open set model attribution of deepfake audio with rejection threshold adaptation (ReTA). Specifically, the reconstruction error learning module trains by combining the representation of system fingerprints with labels corresponding to either the target class or a randomly chosen other class label. This process generates matching and non-matching reconstructed samples, establishing the reconstruction error distributions for each class and laying the foundation for the reject threshold calculation module. The reject threshold calculation module utilizes gaussian probability estimation to fit the distributions of matching and non-matching reconstruction errors. It then computes adaptive reject thresholds for all classes through probability minimization criteria. The experimental results demonstrate the effectiveness of ReTA in improving the open set model attributes of deepfake audio.
Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology
Sharif, Muhammad, Yi, Jiangyan, Shoaib, Muhammad
The language called Balti belongs to the Sino-Tibetan, specifically the Tibeto-Burman language family. It is understood with variations, across populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan, influenced by local cultures and producing various dialects. Considering the diverse cultural, socio-political, religious, and geographical impacts, it is important to step forward unifying the dialects, the basis of common root, lexica, and phonological perspectives, is vital. In the era of globalization and the increasingly frequent developments in AI technology, understanding the diversity and the efforts of dialect unification is important to understanding commonalities and shortening the gaps impacted by unavoidable circumstances. This article analyzes and examines how artificial intelligence AI in the essence of Large Language Models LLMs, can assist in analyzing, documenting, and standardizing the endangered Balti Language, based on the efforts made in different dialects so far.
From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language
Sharif, Muhammad, Abbas, Zeeshan, Yi, Jiangyan, Liu, Chenglin
Automatic Speech Recognition (ASR) technology has witnessed significant advancements in recent years, revolutionizing human-computer interactions. While major languages have benefited from these developments, lesser-resourced languages like Urdu face unique challenges. This paper provides an extensive exploration of the dynamic landscape of ASR research, focusing particularly on the resource-constrained Urdu language, which is widely spoken across South Asian nations. It outlines current research trends, technological advancements, and potential directions for future studies in Urdu ASR, aiming to pave the way for forthcoming researchers interested in this domain. By leveraging contemporary technologies, analyzing existing datasets, and evaluating effective algorithms and tools, the paper seeks to shed light on the unique challenges and opportunities associated with Urdu language processing and its integration into the broader field of speech research.
An Unsupervised Domain Adaptation Method for Locating Manipulated Region in partially fake Audio
Zeng, Siding, Yi, Jiangyan, Tao, Jianhua, Chen, Yujie, Liang, Shan, Ren, Yong, Zhang, Xiaohui
When the task of locating manipulation regions in partially-fake audio (PFA) involves cross-domain datasets, the performance of deep learning models drops significantly due to the shift between the source and target domains. To address this issue, existing approaches often employ data augmentation before training. However, they overlook the characteristics in target domain that are absent in source domain. Inspired by the mixture-of-experts model, we propose an unsupervised method named Samples mining with Diversity and Entropy (SDE). Our method first learns from a collection of diverse experts that achieve great performance from different perspectives in the source domain, but with ambiguity on target samples. We leverage these diverse experts to select the most informative samples by calculating their entropy. Furthermore, we introduced a label generation method tailored for these selected samples that are incorporated in the training process in source domain integrating the target domain information. We applied our method to a cross-domain partially fake audio detection dataset, ADD2023Track2. By introducing 10% of unknown samples from the target domain, we achieved an F1 score of 43.84%, which represents a relative increase of 77.2% compared to the second-best method.
MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition
Lian, Zheng, Sun, Haiyang, Sun, Licai, Wen, Zhuofan, Zhang, Siyuan, Chen, Shun, Gu, Hao, Zhao, Jinming, Ma, Ziyang, Chen, Xie, Yi, Jiangyan, Liu, Rui, Xu, Kele, Liu, Bin, Cambria, Erik, Zhao, Guoying, Schuller, Björn W., Tao, Jianhua
Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate annotations), current systems are hard to meet the demands of practical applications. Therefore, we organize a series of challenges around emotion recognition to further promote the development of this area. Last year, we launched MER2023, focusing on three topics: multi-label learning, noise robustness, and semi-supervised learning. This year, we continue to organize MER2024. In addition to expanding the dataset size, we introduce a new track around open-vocabulary emotion recognition. The main consideration for this track is that existing datasets often fix the label space and use majority voting to enhance annotator consistency, but this process may limit the model's ability to describe subtle emotions. In this track, we encourage participants to generate any number of labels in any category, aiming to describe the emotional state as accurately as possible. Our baseline is based on MERTools and the code is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024.
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection
Zhang, Xiaohui, Yi, Jiangyan, Wang, Chenglong, Zhang, Chuyuan, Zeng, Siding, Tao, Jianhua
The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spread-out distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition.
System Fingerprint Recognition for Deepfake Audio: An Initial Dataset and Investigation
Yan, Xinrui, Yi, Jiangyan, Wang, Chenglong, Tao, Jianhua, Zhou, Junzuo, Gu, Hao, Fu, Ruibo
The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious content manipulation. Therefore, many studies have emerged to detect the so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, it is needed to know what tool or model generated the deepfake audio to explain the decision. This motivates us to ask: Can we recognize the system fingerprints of deepfake audio? In this paper, we present the first deepfake audio dataset for system fingerprint recognition (SFR) and conduct an initial investigation. We collected the dataset from the speech synthesis systems of seven Chinese vendors that use the latest state-of-the-art deep learning technologies, including both clean and compressed sets. In addition, to facilitate the further development of system fingerprint recognition methods, we provide extensive benchmarks that can be compared and research findings. The dataset will be publicly available. .
MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning
Lian, Zheng, Sun, Haiyang, Sun, Licai, Chen, Kang, Xu, Mingyu, Wang, Kexin, Xu, Ke, He, Yu, Li, Ying, Zhao, Jinming, Liu, Ye, Liu, Bin, Yi, Jiangyan, Wang, Meng, Cambria, Erik, Zhao, Guoying, Schuller, Björn W., Tao, Jianhua
The first Multimodal Emotion Recognition Challenge (MER 2023) was successfully held at ACM Multimedia. The challenge focuses on system robustness and consists of three distinct tracks: (1) MER-MULTI, where participants are required to recognize both discrete and dimensional emotions; (2) MER-NOISE, in which noise is added to test videos for modality robustness evaluation; (3) MER-SEMI, which provides a large amount of unlabeled samples for semi-supervised learning. In this paper, we introduce the motivation behind this challenge, describe the benchmark dataset, and provide some statistics about participants. To continue using this dataset after MER 2023, please sign a new End User License Agreement and send it to our official email address merchallenge.contact@gmail.com. We believe this high-quality dataset can become a new benchmark in multimodal emotion recognition, especially for the Chinese research community.
DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection
Fan, Cunhang, Zhang, Hongyu, Huang, Wei, Xue, Jun, Tao, Jianhua, Yi, Jiangyan, Lv, Zhao, Wu, Xiaopei
Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean characteristics. In order to address this problem, this paper proposes a dynamical graph self-distillation (DGSD) approach for AAD, which does not require speech stimuli as input. Specifically, to effectively represent the non-Euclidean properties of EEG signals, dynamical graph convolutional networks are applied to represent the graph structure of EEG signals, which can also extract crucial features related to auditory spatial attention in EEG signals. In addition, to further improve AAD detection performance, self-distillation, consisting of feature distillation and hierarchical distillation strategies at each layer, is integrated. These strategies leverage features and classification results from the deepest network layers to guide the learning of shallow layers. Our experiments are conducted on two publicly available datasets, KUL and DTU. Under a 1-second time window, we achieve results of 90.0\% and 79.6\% accuracy on KUL and DTU, respectively. We compare our DGSD method with competitive baselines, and the experimental results indicate that the detection performance of our proposed DGSD method is not only superior to the best reproducible baseline but also significantly reduces the number of trainable parameters by approximately 100 times.