DarkStream: real-time speech anonymization with low latency

Quamer, Waris, Gutierrez-Osuna, Ricardo

arXiv.org Artificial Intelligence 

Abstract--We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. T o improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and transformer-based contextual layers. T o further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-preserving real-time speech communication. V oice recordings contain rich biometric information that reveals not only linguistic content but also personal attributes such as speaker identity, sex, and age, as well as paralin-guistics (dialect/accent, emotions). Such sensitive information can be exploited by adversaries for speaker recognition and profiling, raising significant privacy concerns.