bypass
MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors
Tong, Xin, Lin, Zhi, Wang, Jingya, Han, Meng, Jin, Bo
Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier "refusal-direction" edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains.
PLA: Prompt Learning Attack against Text-to-Image Generative Models
Lyu, Xinqi, Liu, Yihao, Li, Yanjie, Xiao, Bin
T ext-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-F or-W ork (NSFW) content. T o investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. T o facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework ( PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. W arning: This paper may contain offensive model-generated content.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Jailbreaking Large Language Models in Infinitely Many Ways
Goldstein, Oliver, La Malfa, Emanuele, Drinkall, Felix, Marro, Samuele, Wooldridge, Michael
We discuss the "Infinitely Many Meanings" attacks (IMM), a category of jailbreaks that leverages the increasing capabilities of a model to handle paraphrases and encoded communications to bypass their defensive mechanisms. IMMs' viability pairs and grows with a model's capabilities to handle and bind the semantics of simple mappings between tokens and work extremely well in practice, posing a concrete threat to the users of the most powerful LLMs in commerce. We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies. One can protect against IMMs by improving the guardrails and making them scale with the LLMs' capabilities. For two categories of attacks that are straightforward to implement, i.e., bijection and encoding, we discuss two defensive strategies, one in token and the other in embedding space. We conclude with some research questions we believe should be prioritised to enhance the defensive mechanisms of LLMs and our understanding of their safety.
- Europe > France (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Germany (0.04)
- (3 more...)
How OpenAI stress-tests its large language models
The first paper describes how OpenAI directs an extensive network of human testers outside the company to vet the behavior of its models before they are released. The second paper presents a new way to automate parts of the testing process, using a large language model like GPT-4 to come up with novel ways to bypass its own guardrails. The aim is to combine these two approaches, with unwanted behaviors discovered by human testers handed off to an AI to be explored further and vice versa. Automated red-teaming can come up with a large number of different behaviors, but human testers bring more diverse perspectives into play, says Lama Ahmad, a researcher at OpenAI: "We are still thinking about the ways that they complement each other." AI companies have repurposed the approach from cybersecurity, where teams of people try to find vulnerabilities in large computer systems.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
Fox News AI Newsletter: 'Fargo' creator: 'We've got a fight on our hands'
"Fargo" series creator Noah Hawley spoke with Fox News Digital at the Emmys, and warned that while he doesn't think AI can replicate human creativity, it still poses a threat. Noah Hawley attends the premiere of FOX's "Lucy In The Sky" at Darryl Zanuck Theater at FOX Studios on Sept. 25, 2019, in Los Angeles. READY FOR BATTLE: "Fargo" series creator Noah Hawley is wary of the good and bad in artificial intelligence. AI OPTIMISM: A prominent Silicon Valley businessman and venture capitalist believes artificial intelligence can spur deflation and create enough growth to help those whose jobs will be lost to the technology. MEDICAL MIRACLE: A New York man who was left paralyzed after a diving accident is starting to regain movement a year after receiving an artificial intelligence-powered implant in his brain.
- North America > United States > New York (0.27)
- North America > United States > California > Los Angeles County > Los Angeles (0.27)
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Liu, Yi, Deng, Gelei, Xu, Zhengzi, Li, Yuekang, Zheng, Yaowen, Zhang, Ying, Zhao, Lida, Zhang, Tianwei, Liu, Yang
Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. Our study investigates three key research questions: (1) the number of different prompt types that can jailbreak LLMs, (2) the effectiveness of jailbreak prompts in circumventing LLM constraints, and (3) the resilience of ChatGPT against these jailbreak prompts. Initially, we develop a classification model to analyze the distribution of existing prompts, identifying ten distinct patterns and three categories of jailbreak prompts. Subsequently, we assess the jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a dataset of 3,120 jailbreak questions across eight prohibited scenarios. Finally, we evaluate the resistance of ChatGPT against jailbreak prompts, finding that the prompts can consistently evade the restrictions in 40 use-case scenarios. The study underscores the importance of prompt structures in jailbreaking LLMs and discusses the challenges of robust jailbreak prompt generation and prevention.
- Oceania > Australia > New South Wales (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Asia > Singapore (0.04)
Does ChatGPT have a character limit? Here's how to bypass it
Follow-up on an incomplete response: If ChatGPT stops generating text abruptly, simply type "Continue" as a follow-up prompt. You can also specify the last sentence and ask the chatbot to continue where it left off. Write a more descriptive prompt: If ChatGPT generated too little text and didn't get to reach its character limit, you will need to modify your prompt. Simply specify the number of words you want it to write. An example would be "Write a 500-word essay on climate change".
Multi-Agent Path Finding Based on Subdimensional Expansion with Bypass
Multi-agent path finding (MAPF) is an active area in artificial intelligence, which has many real-world applications such as warehouse management, traffic control, robotics, etc. Recently, M* and its variants have greatly improved the ability to solve the MAPF problem. Although subdimensional expansion used in those approaches significantly decreases the dimensionality of the joint search space and reduces the branching factor, they do not make full use of the possible non-uniqueness of the optimal path of each agent. As a result, the updating of the collision sets may bring a large number of redundant computation. In this paper, the idea of bypass is introduced into subdimensional expansion to reduce the redundant computation. Specifically, we propose the BPM* algorithm, which is an implementation of subdimensional expansion with bypass in M*. In the experiments, we show that BPM* outperforms the state-of-the-art in solving several MAPF benchmark problems.
Machine learning the hard way: Watson's fatal misdiagnosis
Opinion It started in Jeopardy and ended in loss. IBM's flagship AI Watson Health has been sold to venture capitalists for an undisclosed sum thought to be around a billion dollars, or a quarter of what the division cost IBM in acquisitions alone since it was spun off in 2015. Not the first nor the last massively expensive tech biz cock-up, but isn't AI supposed to be the future? Isn't IBM supposed to be good at this? It all started so well.
Why is Cybersecurity Failing Against Ransomware?
Yes, security is hard – no one is ever 100 percent safe from the threats lurking out there. But how is it that time and time again, companies – big companies – are continuing to fall for ransomware attacks? Let's explore the main reasons why, starting with some basics before getting more in-depth: Two-factor authentication (2FA) is probably the easiest security improvement an organization can implement, and it's one of the most advocated-for solutions by infosec professionals. Despite this, we continue to see breaches like Colonial Pipeline occur because organizations have either failed to implement 2FA or have failed to *fully* implement it. Anything that requires a username and password to access should have 2FA enabled.
- North America > United States (0.48)
- Asia > Russia (0.29)
- Europe > Russia (0.14)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (0.48)
- Government > Military > Cyberwarfare (0.40)