guardrail
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Warning: this paper may contain potentially generated harmful content. Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-andmodel solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained token-level annotations to provide reasonable supervision for token-level training. Then, we propose the Streaming Content Monitor (SCM), which is trained with dual supervision of response-and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
Anthropic Is Still at Odds With the White House Over Claude Fable 5
Anthropic leaders flew to Washington, DC, to meet with White House officials on Monday. Trump administration officials concluded talks with Anthropic on Monday without lifting export controls that were imposed last week on the company's most advanced AI models in response to jailbreaking concerns, according to three people briefed on the matter. The administration continues to believe that there are ways to disable some of the guardrails on Anthropic's Claude Fable 5, effectively allowing users to access the more powerful cybersecurity capabilities of the company's Mythos model, the people said. Anthropic has said for days that the administration's concerns are overblown, a position it reiterated in working group meetings held at the Commerce Department with government researchers from Center for AI Standards and Innovation (CAISI) and the Office of the National Cyber Director Sean Cairncross, one of the people said. The meetings were also attended by Commerce secretary Howard Lutnick, who dialed in by conference call from the G7 summit in Evian, France.
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
Despite their superior performance on a wide range of domains, large language models (LLMs) remain vulnerable to misuse for generating harmful content, a risk that has been further amplified by various jailbreak attacks. Existing jailbreak attacks mainly follow sequential logic, where LLMs understand and answer each given task one by one. However, concurrency, a natural extension of the sequential scenario, has been largely overlooked. In this work, we first propose a word-level method to enable task concurrency in LLMs, where adjacent words encode divergent intents. Although LLMs maintain strong utility in answering concurrent tasks, which is demonstrated by our evaluations on mathematical and general question-answering benchmarks, we notably observe that combining a harmful task with a benign one significantly reduces the probability of it being filtered by the guardrail, showing the potential risks associated with concurrency in LLMs. Based on these findings, we introduce $\texttt{JAIL-CON}$, an iterative attack framework that $\underline{\text{JAIL}}$breaks LLMs via task $\underline{\text{CON}}$currency. Experiments on widely-used LLMs demonstrate the strong jailbreak capabilities of $\texttt{JAIL-CON}$ compared to existing attacks. Furthermore, when the guardrail is applied as a defense, compared to the sequential answers generated by previous attacks, the concurrent answers in our $\texttt{JAIL-CON}$ exhibit greater stealthiness and are less detectable by the guardrail, highlighting the unique feature of task concurrency in jailbreaking LLMs.
The Meta hack shows there's more to AI security than Mythos
On June 5, reported that attackers had been using Meta's AI customer support agent to steal Instagram accounts. Their approach was simple: They asked the agent to link the accounts to email addresses that they controlled, and the agent complied. One attacker broke into the dormant Obama White House account and made pro-Iran posts; others took over accounts with valuable, single-word handles, possibly in order to sell them. AI cybersecurity concerns are nothing new. Since Anthropic announced in April that its Mythos model was too good at hacking to be released to the general public, commentators, researchers, and federal officials alike have fixated on the idea that superpowered AI systems could lay waste to our computer infrastructure. That's not quite what this Instagram hack was: There, AI was the target rather than the attacker, and the method was far simpler than anything Mythos would cook up. But as companies offload more work to AI, these comparatively unsophisticated attacks could wreak their own havoc. "As AI becomes more and more widely used--especially when AI is more and more widely used to automate our work flows, like account recovery--I think attackers are going to be more and more motivated to attack AI itself," says Neil Gong, a professor of electrical and computer engineering at Duke University.
From guardrails to governance: A CEO's guide for securing agentic systems
A practical blueprint for companies and CEOs that shows how to secure agentic systems by shifting from prompt tinkering to hard controls on identity, tools, and data. The previous article in this series, " Rules fail at the prompt, succeed at the boundary," focused on the first AI-orchestrated espionage campaign and the failure of prompt-level control. This article is the prescription. Across recent AI security guidance from standards bodies, regulators, and major providers, a simple idea keeps repeating: treat agents like powerful, semi-autonomous users, and enforce rules at the boundaries where they touch identity, tools, data, and outputs. These steps help define identity and limit capabilities. Today, agents run under vague, over-privileged service identities.
Pro-AI Super PACs Are Already All In on the Midterms
Silicon Valley's battle against AI regulation is already shaping the next US election cycle. Silicon Valley is already pouring tens of millions of dollars into the midterm elections taking place across the US in 2026, as the tech industry's war over AI regulation moves decisively into American politics. Technology executives, investors, and companies tied to the AI boom are funding a new network of AI-focused super PACS, which is poised to make AI a major issue in this year's state and federal elections races. The election spending marks a sharp escalation of the AI regulation debate that has divided Silicon Valley for years. In the absence of federal action, state lawmakers in New York, California, and Colorado have passed laws in the past year requiring large AI developers to disclose safety practices and assess risks such as algorithmic discrimination.
Interview with Anindya Das Antar: Evaluating effectiveness of moderation guardrails in aligning LLM outputs
In their paper presented at AIES 2025, "Do Your Guardrails Even Guard?" Method for Evaluating Effectiveness of Moderation Guardrails in Aligning LLM Outputs with Expert User Expectations, Anindya Das Antar Xun Huan and Nikola Banovic propose a method to evaluate and select guardrails that best align LLM outputs with domain knowledge from subject-matter experts. Here, Anindya tells us more about their method, some case studies, and plans for future developments. Could you give us some background to your work - why are guardrails such an important area for study? Ensuring that large language models (LLMs) produce desirable outputs without harmful side effects and align with user expectations, organizational goals, and existing domain knowledge is crucial for their adoption in high-stakes decision-making. However, despite training on vast amounts of data, LLMs can still produce incorrect, misleading, or otherwise unexpected and undesirable outputs.
How Christian Leaders Are Challenging the AI Boom
Pope Leo XIV made his first address to the College of Cardinals on May 10, 2025 in Vatican City, and touched upon the rise of artificial intelligence. Pope Leo XIV made his first address to the College of Cardinals on May 10, 2025 in Vatican City, and touched upon the rise of artificial intelligence. As technologists race to accelerate AI's progress with minimal guardrails, they are being met with increasing resistance from a powerful global contingent: Christian leaders and their congregations. Christians are not a monolith by any means. But this year, Christian leaders across sects--including Catholics, Evangelicals, and Baptists--sounded the alarm on AI's potential impact on family, human relationships, labor, and the church itself.
Google's and OpenAI's Chatbots Can Strip Women in Photos Down to Bikinis
Users of AI image generators are offering each other instructions on how to use the tech to alter pictures of women into realistic, revealing deepfakes. Some users of popular chatbots are generating bikini deepfakes using photos of fully clothed women as their source material. Most of these fake images appear to be generated without the consent of the women in the photos. Some of these same users are also offering advice to others on how to use the generative AI tools to strip the clothes off of women in photos and make them appear to be wearing bikinis. Under a now-deleted Reddit post titled "gemini nsfw image generation is so easy," users traded tips for how to get Gemini, Google's generative AI model, to make pictures of women in revealing clothes.