deepfake audio
SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks
Uddin, Kutub, Khan, Awais, Farooq, Muhammad Umar, Malik, Khalid
Audio plays a crucial role in applications like speaker verification, voice-enabled smart devices, and audio conferencing. However, audio manipulations, such as deepfakes, pose significant risks by enabling the spread of misinformation. Our empirical analysis reveals that existing methods for detecting deepfake audio are often vulnerable to anti-forensic (AF) attacks, particularly those attacked using generative adversarial networks. In this article, we propose a novel collaborative learning method called SHIELD to defend against generative AF attacks. T o expose AF signatures, we integrate an auxiliary generative model, called the defense (DF) generative model, which facilitates collaborative learning by combining input and output. Furthermore, we design a triplet model to capture correlations for real and AF attacked audios with real-generated and attacked-generated audios using auxiliary generative models. The proposed SHIELD strengthens the defense against generative AF attacks and achieves robust performance across various generative models. The proposed AF significantly reduces the average detection accuracy from 95.49% to 59.77% for ASVspoof2019, from 99.44% to 38.45% for In-the-Wild, and from 98.41% to 51.18% for HalfTruth for three different generative models. The proposed SHIELD mechanism is robust against AF attacks and achieves an average accuracy of 98.13%, 98.58%, and 99.57% in match, and 98.78%, 98.62%, and 98.85% in mismatch settings for the ASVspoof2019, In-the-Wild, and HalfTruth datasets, respectively.
- North America > United States > Michigan > Genesee County > Flint (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
Reject Threshold Adaptation for Open-Set Model Attribution of Deepfake Audio
Yan, Xinrui, Yi, Jiangyan, Tao, Jianhua, Chen, Yujie, Gu, Hao, Li, Guanjun, Zhou, Junzuo, Ren, Yong, Xu, Tao
Open environment oriented open set model attribution of deepfake audio is an emerging research topic, aiming to identify the generation models of deepfake audio. Most previous work requires manually setting a rejection threshold for unknown classes to compare with predicted probabilities. However, models often overfit training instances and generate overly confident predictions. Moreover, thresholds that effectively distinguish unknown categories in the current dataset may not be suitable for identifying known and unknown categories in another data distribution. To address the issues, we propose a novel framework for open set model attribution of deepfake audio with rejection threshold adaptation (ReTA). Specifically, the reconstruction error learning module trains by combining the representation of system fingerprints with labels corresponding to either the target class or a randomly chosen other class label. This process generates matching and non-matching reconstructed samples, establishing the reconstruction error distributions for each class and laying the foundation for the reject threshold calculation module. The reject threshold calculation module utilizes gaussian probability estimation to fit the distributions of matching and non-matching reconstruction errors. It then computes adaptive reject thresholds for all classes through probability minimization criteria. The experimental results demonstrate the effectiveness of ReTA in improving the open set model attributes of deepfake audio.
- Research Report (0.84)
- Instructional Material > Course Syllabus & Notes (0.50)
Quantum-Trained Convolutional Neural Network for Deepfake Audio Detection
Lin, Chu-Hsuan Abraham, Liu, Chen-Yu, Chen, Samuel Yen-Chi, Chen, Kuan-Cheng
The rise of deepfake technologies has posed significant challenges to privacy, security, and information integrity, particularly in audio and multimedia content. This paper introduces a Quantum-Trained Convolutional Neural Network (QT-CNN) framework designed to enhance the detection of deepfake audio, leveraging the computational power of quantum machine learning (QML). The QT-CNN employs a hybrid quantum-classical approach, integrating Quantum Neural Networks (QNNs) with classical neural architectures to optimize training efficiency while reducing the number of trainable parameters. Our method incorporates a novel quantum-to-classical parameter mapping that effectively utilizes quantum states to enhance the expressive power of the model, achieving up to 70% parameter reduction compared to classical models without compromising accuracy. Data pre-processing involved extracting essential audio features, label encoding, feature scaling, and constructing sequential datasets for robust model evaluation. Experimental results demonstrate that the QT-CNN achieves comparable performance to traditional CNNs, maintaining high accuracy during training and testing phases across varying configurations of QNN blocks. The QT framework's ability to reduce computational overhead while maintaining performance underscores its potential for real-world applications in deepfake detection and other resource-constrained scenarios. This work highlights the practical benefits of integrating quantum computing into artificial intelligence, offering a scalable and efficient approach to advancing deepfake detection technologies.
- Europe > United Kingdom > England > Greater London > London (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap
Channing, Georgia, Sock, Juil, Clark, Ronald, Torr, Philip, de Witt, Christian Schroeder
The rapid proliferation of AI-manipulated or generated audio deepfakes poses serious challenges to media integrity and election security. Current AI-driven detection solutions lack explainability and underperform in real-world settings. In this paper, we introduce novel explainability methods for state-of-the-art transformer-based audio deepfake detectors and open-source a novel benchmark for real-world generalizability. By narrowing the explainability gap between transformer-based audio deepfake detectors and traditional methods, our results not only build trust with human experts, but also pave the way for unlocking the potential of citizen intelligence to overcome the scalability issue in audio deepfake detection.
- North America > United States (0.14)
- North America > Mexico (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Middle East > Jordan (0.04)
Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
Xie, Yuankun, Xiong, Chenxu, Wang, Xiaopeng, Wang, Zhiyong, Lu, Yi, Qi, Xin, Fu, Ruibo, Liu, Yukun, Wen, Zhengqi, Tao, Jianhua, Li, Guanjun, Ye, Long
Currently, Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs. These ALMs have significantly lowered the barrier to creating deepfake audio, generating highly realistic and diverse types of deepfake audio, which pose severe threats to society. Consequently, effective audio deepfake detection technologies to detect ALM-based audio have become increasingly critical. This paper investigate the effectiveness of current countermeasure (CM) against ALM-based audio. Specifically, we collect 12 types of the latest ALM-based deepfake audio and utilizing the latest CMs to evaluate. Our findings reveal that the latest codec-trained CM can effectively detect ALM-based audio, achieving 0% equal error rate under most ALM test conditions, which exceeded our expectations. This indicates promising directions for future research in ALM-based deepfake audio detection.
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Lu, Yi, Xie, Yuankun, Fu, Ruibo, Wen, Zhengqi, Tao, Jianhua, Wang, Zhiyong, Qi, Xin, Liu, Xuefei, Li, Yongwei, Liu, Yukun, Wang, Xiaopeng, Shi, Shuchen
With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to-end generation process, skipping the final step of vocoder processing. This poses a significant challenge for current audio deepfake detection (ADD) models based on vocoder artifacts. To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform. We propose Codecfake dataset, which is generated by seven representative neural codec methods. Experiment results show that codec-trained ADD models exhibit a 41.406% reduction in average equal error rate compared to vocoder-trained ADD models on the Codecfake test set.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)
The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio
Xie, Yuankun, Lu, Yi, Fu, Ruibo, Wen, Zhengqi, Wang, Zhiyong, Tao, Jianhua, Qi, Xin, Wang, Xiaopeng, Liu, Yukun, Cheng, Haonan, Ye, Long, Sun, Yi
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including 2 languages, over 1M audio samples, and various test conditions, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.
Deepfake audio of Biden alarms experts in lead-up to U.S. elections
No political deepfake has alarmed the world's disinformation experts more than the doctored audio message of U.S. President Joe Biden that began circulating over the weekend. In the phone message, a voice edited to sound like Biden urged voters in New Hampshire not to cast their ballots in Tuesday's Democratic primary. "Save your vote for the November election," the phone message went.
System Fingerprint Recognition for Deepfake Audio: An Initial Dataset and Investigation
Yan, Xinrui, Yi, Jiangyan, Wang, Chenglong, Tao, Jianhua, Zhou, Junzuo, Gu, Hao, Fu, Ruibo
The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious content manipulation. Therefore, many studies have emerged to detect the so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, it is needed to know what tool or model generated the deepfake audio to explain the decision. This motivates us to ask: Can we recognize the system fingerprints of deepfake audio? In this paper, we present the first deepfake audio dataset for system fingerprint recognition (SFR) and conduct an initial investigation. We collected the dataset from the speech synthesis systems of seven Chinese vendors that use the latest state-of-the-art deep learning technologies, including both clean and compressed sets. In addition, to facilitate the further development of system fingerprint recognition methods, we provide extensive benchmarks that can be compared and research findings. The dataset will be publicly available. .
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Zhejiang Province (0.04)
- (2 more...)
Deepfake audio has a tell
An office worker answers it and hears his boss, in a panic, tell him that she forgot to transfer money to the new contractor before she left for the day and needs him to do it. She gives him the wire transfer information, and with the money transferred, the crisis has been averted. The worker sits back in his chair, takes a deep breath, and watches as his boss walks in the door. The voice on the other end of the call was not his boss. The voice he heard was that of an audio deepfake, a machine-generated audio sample designed to sound exactly like his boss.