Goto

Collaborating Authors

 safeguard


ChatGPT can be made to generate sexualised and violent images, researchers find

BBC News

The latest public version of ChatGPT can be made to generate sexualised images or depict scenes of graphic violence with a simple prompt, researchers have told the BBC. British AI security startup Mindgard figured out how to make ChatGPT create graphic pictures by slightly altering a widely-shared instruction, or prompt, which was originally designed to produce humorous results. After being contacted by the BBC, ChatGPT's maker OpenAI said it had taken action to stop the chatbot responding with those types of images. After investigating this trend, we've introduced additional safeguards against this type of prompt, it said in a statement. It also said it has multiple layers of protection to prevent users making content which breaches its terms and conditions.


RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Neural Information Processing Systems

Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models--designed to monitor LLM inputs and outputs and block potentially harmful content--has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: (1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and (2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction.


SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Neural Information Processing Systems

By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment. Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards. Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal attacks, often exhibiting overdefensive behaviors and imposing heavy training overhead.


RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Neural Information Processing Systems

Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models--designed to monitor LLM inputs and outputs and block potentially harmful content--has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: (1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and (2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction.


Anthropic Walks Back Policy That Could Have 'Sabotaged' AI Researchers Using Claude

WIRED

Anthropic Walks Back Policy That Could Have'Sabotaged' AI Researchers Using Claude The company changed course after researchers spoke out against the policy, which would have covertly limited Claude's ability to develop competing AI models. Anthropic is backtracking on a policy that would have covertly limited competitors from using its new AI model, Claude Fable 5, to develop other AI models. The company changed course after the move received significant backlash from the AI research community . "We're changing Fable 5's safeguards for frontier LLM development to make them visible," Anthropic said in a statement to WIRED. "We made the wrong tradeoff and we apologize for not getting the balance right."


Anthropic's Fable AI brings the capabilities of its unreleased Mythos model to regular users

Engadget

Anthropic's Fable AI brings the capabilities of its unreleased Mythos model to regular users Anthropic's Fable AI brings the capabilities of its unreleased Mythos model to regular users Claude subscribers can try the model until June 22 without spending usage credits. Anthropic has just announced Fable, the start of a new family of models that brings many of the capabilities of its Mythos system to the public. As a refresher, Mythos is the state-of-the-art model Anthropic debuted at the start of April through Project Glasswing . The project saw Anthropic share access to the model with select partners, including Apple and NVIDIA, with the aim of helping those organizations harden their software against AI cyberattacks. Glasswing also prompted the White House to rethink its policy on AI regulation .


Anthropic Offers Mythos Upgrade for Cyber Partners and a 'Safe' Version for the Rest of You

WIRED

Anthropic Offers Mythos Upgrade for Cyber Partners and a'Safe' Version for the Rest of You Anthropic is releasing Claude Mythos 5 to trusted organizations and Claude Fable 5 to the public, a version it says can't be used for cyberattacks. Anthropic released two new AI models called Claude Fable 5 and Claude Mythos 5 on Tuesday, which the company says have greater capabilities than the Mythos Preview model it released in April to a limited set of tech industry partners. Anthropic has said the initial, limited release stemmed from concerns that the model's capabilities could be exploited by bad actors to develop hacking tools that could catch defenders off guard. Anthropic is currently only releasing Claude Mythos 5 to a limited set of industry partners, many of which received access to Mythos Preview, and the company says it is collaborating with the US government on the rollout. Claude Fable 5, which is being publicly released, uses the same underlying model as Mythos 5, but will have "guardrails" in place at launch, the company said Tuesday, that will block the model from answering many user questions related to cybersecurity, biology, and chemistry.


Tennessee minors sue Musk's xAI, alleging Grok generated sexual images of them

The Japan Times

Tennessee minors sue Musk's xAI, alleging Grok generated sexual images of them Governments and regulators around the world have launched probes into xAI, imposed bans and demanded safeguards in a growing push to curb illegal and offensive material. Three Tennessee plaintiffs, including two minors, sued Elon Musk's xAI on Monday, alleging that it knowingly designed its Grok image generator to let people create sexually explicit content by using real photos of others. The lawsuit, filed in the San Jose, California federal court, is seeking class-action status for people in the United States who were reasonably identifiable in sexualized images or videos generated by Grok based on real images of themselves. The artificial intelligence company did not immediately respond to a request for comment. After an outcry over sexually explicit content generated by the chatbot, xAI said in January that it had blocked all users from editing images of real people in revealing clothing and from generating images of people in revealing clothing in jurisdictions where it's illegal. Governments and regulators around the world have also since launched probes, imposed bans and demanded safeguards in a growing push to curb illegal and offensive material.


Mind launches inquiry into AI and mental health after Guardian investigation

The Guardian

The Guardian revealed how people were being put at risk of harm by false and misleading health information in Google AI Overviews. The Guardian revealed how people were being put at risk of harm by false and misleading health information in Google AI Overviews. Exclusive: England and Wales charity to examine safeguards after Guardian exposed'very dangerous' advice on Google AI Overviews'Very dangerous': a Mind mental health expert on Google's AI summaries Mind is launching a significant inquiry into artificial intelligence and mental health after a Guardian investigation exposed how Google's AI Overviews gave people "very dangerous" medical advice. In a year-long commission, the mental health charity, which operates in England and Wales, will examine the risks and safeguards required as AI increasingly influences the lives of millions of people affected by mental health issues worldwide. The inquiry - the first of its kind globally - will bring together the world's leading doctors and mental health professionals, as well as people with lived experience, health providers, policymakers and tech companies.