HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech

Mahapatra, Aurosweta, Ulgen, Ismail Rasim, Sisman, Berrak

arXiv.org Artificial Intelligence 

Abstract--Current anti-spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. In this paper, we propose HuLA, a two-stage prosody-aware multi-task learning framework for spoof detection. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out-of-domain dataset, including expressive, emotional, and cross-lingual attacks. These results demonstrate that explicit prosodic supervision, combined with SSL embeddings, substantially improves robustness against advanced synthetic speech attacks. Anti-spoofing aims to detect audio generated through replay attacks, speech synthesis, and voice conversion (VC) [1]. Recent progress in text-to-speech (TTS) [2]-[6] and VC systems [7]-[11] has amplified concerns about expressive synthetic speech, which can be misused to compromise bio-metric authentication or impersonate speakers for spreading misinformation [12], [13]. One of the goals of speech generation is to produce speech that is natural and indistinguishable from human speech. Expressiveness and emotion are the defining characteristics of human speech. While this limitation is a weakness for synthesis, it represents a valuable opportunity for anti-spoofing: imperfect expressive-A. Mahapatra is with the Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: amahapa2@jhu.edu). Ismail Rasim Ulgen is with the Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: iulgen1@jhu.edu).