spread malware
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Ji, Jiabao, Hou, Bairu, Robey, Alexander, Pappas, George J., Hassani, Hamed, Zhang, Yang, Wong, Eric, Chang, Shiyu
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Li, Xuan, Zhou, Zhanke, Zhu, Jianing, Yao, Jiangchao, Liu, Tongliang, Han, Bo
Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment w.r.t. the authority power for inciting harmfulness, we disclose a lightweight method, termed DeepInception, which can easily hypnotize LLM to be a jailbreaker. Specifically, DeepInception leverages the personification ability of LLM to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario. Empirically, our DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open and closed-source LLMs like Falcon, Vicuna-v1.5, Llama-2, and GPT-3.5-turbo/4. Our investigation appeals to people to pay more attention to the safety aspects of LLMs and develop a stronger defense against their misuse risks. The code is publicly available at: https://github.com/tmlr-group/DeepInception.
How ChatGPT--and Bots Like It--Can Spread Malware
The AI landscape has started to move very, very fast: consumer-facing tools such as Midjourney and ChatGPT are now able to produce incredible image and text results in seconds based on natural language prompts, and we're seeing them get deployed everywhere from web search to children's books. However, these AI applications are being turned to more nefarious uses, including spreading malware. Take the traditional scam email, for example: It's usually littered with obvious mistakes in its grammar and spelling--mistakes that the latest group of AI models don't make, as noted in a recent advisory report from Europol. Think about it: A lot of phishing attacks and other security threats rely on social engineering, duping users into revealing passwords, financial information, or other sensitive data. The persuasive, authentic-sounding text required for these scams can now be pumped out quite easily, with no human effort required, and endlessly tweaked and refined for specific audiences.