ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams
Grabovski, Freddie, Gressel, Gilad, Mirsky, Yisroel
–arXiv.org Artificial Intelligence
--Large Language Models (LLMs), combined with T ext-to-Speech (TTS) and Automatic Speech Recognition (ASR), are increasingly used to automate voice phishing (vishing) scams. These systems are scalable and convincing, posing a significant security threat. We identify the ASR transcription step as the most vulnerable link in the scam pipeline and introduce ASRJam, a proactive defence framework that injects adversarial perturbations into the victim's audio to disrupt the attacker's ASR. This breaks the scam's feedback loop without affecting human callers, who can still understand the conversation. While prior adversarial audio techniques are often unpleasant and impractical for real-time use, we also propose EchoGuard, a novel jammer that leverages natural distortions, such as reverberation and echo, that are disruptive to ASR but tolerable to humans. T o evaluate EchoGuard's effectiveness and usability, we conducted a 39-person user study comparing it with three state-of-the-art attacks. Results show that EchoGuard achieved the highest overall utility, offering the best combination of ASR disruption and human listening experience. Large Language Models (LLMs) are now widely used across many applications, demonstrating impressive progress in understanding and generating natural language [1], [2], [3]. When combined with text-to-speech (TTS) and automatic speech recognition (ASR) technologies, LLMs enable powerful new capabilities such as automated customer service, outbound sales, cold calling, and advanced virtual assistants. However, as these systems become more realistic and lifelike, they also raise significant security concerns. LLMs have proven effective at generating phishing content that rivals human-written emails [4], [5], contributing to a 703% rise in credential phishing in 2024. The integration of LLMs with speech synthesis into real-time, automated scam agents is the inevitable conclusion [6]. V oice agents operate by chaining together a sequence of neural networks to handle calls in real time: (1) ASR transcribes the victim's speech into text, (2) An LLM generates an appropriate textual response, and (3) TTS synthesizes that response into natural-sounding audio. This pipeline enables scalable voice interactions that can convincingly impersonate trusted entities and extract sensitive information from victims as seen in Figure 1.
arXiv.org Artificial Intelligence
Jun-16-2025
- Country:
- Europe
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Switzerland > Basel-City
- Basel (0.04)
- Italy > Calabria
- North America > United States
- California > Santa Clara County > Santa Clara (0.04)
- Europe
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: