bomb
In-Context Representation Hijacking
Yona, Itay, Sarid, Amir, Karasik, Michael, Gandelsman, Yossi
We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States (0.04)
- Government (1.00)
- Information Technology > Security & Privacy (0.93)
- Law Enforcement & Public Safety (0.89)
- (2 more...)
Appendix A Additional Related Work
Utilizing global information to reduce the complexity of imperfect-information games has also been investigated in some works. In their implementation, the value network of the agent can observe the full information about the game state, including those that are hidden from the policy. They argue that such a training style improves training performance. Moreover, in Suphx [15], a strong Mahjong AI system, they used a similar method namely oracle guiding. Particularly, in the beginning of the training stage, all global information is utilized; then, as the training goes, the additional information would be dropped out slowly to none, and only the information that the agent is allowed to observe is reserved in the subsequent training stage.
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
He, Ziyuan, Wang, Yuxuan, Li, Jiaqi, Liang, Kexin, Zhang, Muhan
Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark designed to evaluate LLMs' long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6 locally deployed and 4 API-based LLMs. The evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark. Despite the extensive context windows, popular LLMs are only capable of understanding a much shorter length of context than they claim to be, revealing significant limitations in their ability to handle real-world tasks with long dependencies and highlighting substantial room for model improvement in practical long-context understanding.
- North America > United States (0.46)
- Asia > China > Beijing > Beijing (0.04)
- Law (1.00)
- Banking & Finance (0.67)
- Leisure & Entertainment > Games > Computer Games (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
Paragliders: The army's lethal new weapon in Myanmar's civil war
It was a Monday night in Myanmar's Chang U township in the central Sagaing region, where nearly 100 people had gathered to mark Thadingyut, the festival of the full moon. Some held candles at the event, which doubled as both a celebration and a protest against the military, which seized power in 2021, plunging the country into a bloody civil war. But the celebration soon turned into horror as a motorised paraglider - known locally as a paramotor - flew overhead and dropped bombs onto the crowd. The attack lasted just seven minutes, but at least 26 people died as a result and dozens more were injured. Initially, I thought the lower part of my body had been severed, one 30-year-old who was at the gathering told news agency Reuters.
- Asia > Myanmar > Sagaing Region > Sagaing (0.26)
- South America (0.15)
- North America > Central America (0.15)
- (20 more...)
Sunken WWII bombs make a surprising home for sea life
A new study finds algae, mussels, and starfish flock to munitions dumped in the Baltic Sea. Breakthroughs, discoveries, and DIY tips sent every weekday. As the ink dried on Germany's unconditional surrender on May 8, 1945, celebrations erupted across the world. People cheered, wept, and kissed in the streets as World War II finally came to an end in Europe. A few months later at the Potsdam Conference, Germany agreed to demilitarize and dismantle its once formidable army, leaving the nation with lots and lots of leftover munitions.
- Atlantic Ocean > North Atlantic Ocean > Baltic Sea (0.64)
- Europe > Germany > Brandenburg > Potsdam (0.25)
- North America > United States > New York (0.05)
- (2 more...)
Lebanon pushes for US support as family killed by Israel attack are buried
Why is Israel still in southern Lebanon? A war to shape Lebanon's future Lebanon is pushing to get more support from the United States after another deadly Israeli drone attack on southern Lebanon, which this time killed five people, including three children, the latest in a series of near-daily violations by Israel of the US-brokered November 2024 ceasefire. President Joseph Aoun and other officials met with a delegation led by US Secretary of State Marco Rubio, the Lebanese presidency said in a statement on Tuesday. The Lebanese president said he wants Israel to stop occupying parts of his country, is looking to gear its army with "equipment and supplies" from the US, and needs Washington's support to hold a conference dedicated to reconstruction in Lebanon. Amid ongoing efforts to disarm Hezbollah, Aoun emphasised that the Lebanese army's mandate includes "all Lebanese regions" as the country tries to seize an opportunity "to achieve just, comprehensive, and lasting peace in the Middle East region". He is also scheduled to address the United Nations General Assembly on Tuesday, where he is expected to denounce Israeli attacks across the region, including in Gaza and Lebanon.
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Murphy, Brendan, Bowen, Dillon, Mohammadzadeh, Shahrad, Tseng, Tom, Broomfield, Julius, Gleave, Adam, Pelrine, Kellin
AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models with safeguards destroyed. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks. Stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attacks and potentially defenses in the input and weight spaces. Not only are current models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Government (0.86)
- Law Enforcement & Public Safety (0.86)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Luo, Xuan, Wang, Yue, He, Zefeng, Tu, Geng, Li, Jing, Xu, Ruifeng
Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries. To strengthen safety protections, jailbreak methods are developed to simulate malicious attacks and uncover vulnerabilities. In this paper, we introduce HILL (Hiding Intention by Learning from LLMs), a novel jailbreak approach that systematically transforms imperative harmful requests into learning-style questions with only straightforward hypotheticality indicators. Further, we introduce two new metrics to thoroughly evaluate the utility of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. Results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. Moreover, the assessment on our constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in defense methods.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring
Chu, Junjie, Li, Mingjie, Yang, Ziqing, Leng, Ye, Lin, Chenhao, Shen, Chao, Backes, Michael, Shen, Yun, Zhang, Yang
Accurately determining whether a jailbreak attempt has succeeded is a fundamental yet unresolved challenge. Existing evaluation methods rely on misaligned proxy indicators or naive holistic judgments. They frequently misinterpret model responses, leading to inconsistent and subjective assessments that misalign with human perception. To address this gap, we introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework. Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision. JADES also incorporates an optional fact-checking module to strengthen the detection of hallucinations in jailbreak responses. We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work, consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans. In a binary setting (success/failure), JADES achieves 98.5% agreement with human evaluators, outperforming strong baselines by over 9%. Re-evaluating five popular attacks on four LLMs reveals substantial overestimation (e.g., LAA's attack success rate on GPT-3.5-Turbo drops from 93% to 69%). Our results show that JADES could deliver accurate, consistent, and interpretable evaluations, providing a reliable basis for measuring future jailbreak attacks.
- North America > United States > Massachusetts (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Information Technology > Security & Privacy (1.00)
- Education (1.00)
- Health & Medicine (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.93)