defender
GUARD: Constructing Realistic Two-Player Matrix and Security Games for Benchmarking Game-Theoretic Algorithms
Game-theoretic algorithms are commonly benchmarked on recreational games, classical constructs from economic theory such as congestion and dispersion games, or entirely random game instances. While the past two decades have seen the rise of security games - grounded in real-world scenarios like patrolling and infrastructure protection - their practical evaluation has been hindered by limited access to the datasets used to generate them. In particular, although the structural components of these games (e.g., patrol paths derived from maps) can be replicated, the critical data defining target values - central to utility modeling - remain inaccessible. In this paper, we introduce a flexible framework that leverages open-access datasets to generate realistic matrix and security game instances. These include animal movement data for modeling anti-poaching scenarios and demographic and infrastructure data for infrastructure protection. Our framework allows users to customize utility functions and game parameters, while also offering a suite of preconfigured instances. We provide theoretical results highlighting the degeneracy and limitations of benchmarking on random games, and empirically compare our generated games against random baselines across a variety of standard algorithms for computing Nash and Stackelberg equilibria, including linear programming, incremental strategy generation, and self-play with no-regret learners.
Lifelong Safety Alignment for Language Models
LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR [80] and a 57% transfer ASR on LAT [53] using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments.
NBA needs to incorporate 'mistaken identity' rule from FIFA World Cup to stop the flopping issue
NBA Finals ratings surge as the league welcomes Trump, drops woke messaging -- but is it sustainable? Netflix film chief says they won't work with directors who want to release movies in theaters Disney's Star Wars relaunch crumbles as'Mandalorian and Grogu' crashes at the box office Education Secretary Linda McMahon rips California trans athlete'compromise,' tells Newsom to'pick a side' Jimmy Kimmel says he felt'defeated' after Colbert show was cancelled, says CBS is using'made-up numbers' Here's how the CDC tried to use bad science to convince people to wear masks during COVID'The Mandalorian and Grogu' is a prime example that Disney's Star Wars is on life support'Supergirl' pre-release tracking looks disastrously bad for Hollywood after lead actress' bizarre comments Trump praised for having'lots of energy' ahead of 80th birthday Trump calls Maine Democratic Senate candidate Graham Platner a'thug' Charter Space founder responds to critics' worries about SpaceX impact on market Rep. Byron Donalds shares his faith redemption story amid Florida gubernatorial run Iran's foreign minister says peace with US'has never been closer' GOP lawmaker says it's'really important' that US continues cartel crackdown Spencer Pratt's use of AI to boost campaign sparks debate FBI arrests first suspect on'most wanted fraudsters' list Accused Charlie Kirk killer's attorneys seek to BLOCK death penalty Kayleigh McEnany: Capitalism isn't the big evil Bernie Sanders would have you believe OutKick Analysis NBA needs to incorporate'mistaken identity' rule from FIFA World Cup to stop the flopping issue The World Cup's use of the rule offers a blueprint for real-time consequences INSTANT REACTION FIFA World Cup Now reacts to USA's 4-1 dominant win over Paraguay Melissa Ortiz, Peter Crouch, Sacha Kljestan, Bob Bradley, Stu Holden, Brad Guzan and Mo Edu react to USA's 4-1 win over Paraguay. Flopping is a major issue in the NBA. I've written about it ad nauseam. The league has anti-flopping measures in place, but they rarely dish out fines based on reviews after the conclusion of the games, and in-game flopping calls are even more of rarity.
Detection Framework for Inference Stage Backdoor Defenses
Backdoor attacks involve inserting poisoned samples during training, resulting in a model containing a hidden backdoor that can trigger specific behaviors without impacting performance on normal samples. These attacks are challenging to detect, as the backdoored model appears normal until activated by the backdoor trigger, rendering them particularly stealthy. In this study, we devise a unified inferencestage detection framework to defend against backdoor attacks. We first rigorously formulate the inference-stage backdoor detection problem, encompassing various existing methods, and discuss several challenges and limitations. We then propose a framework with provable guarantees on the false positive rate or the probability of misclassifying a clean sample. Further, we derive the most powerful detection rule to maximize the detection power, namely the rate of accurately identifying a backdoor sample, given a false positive rate under classical learning scenarios.
Discussion of Evaluation Methodologies
In previous research, there are plenty of arguments about textual backdoor evaluation, including diverse metrics and experiment settings. These valuable discussions motivate us to construct a rigorous benchmark and we highly appreciate their efforts. In this section, we briefly summarize existing opinions and provide a more detailed discussion on this topic. Table 9 summarizes the attackers OpenBackdoorimplements. Effectiveness Besides the mainstream ASR (also called LFR [20]) and CACC metrics, there are also other effectiveness metrics. Shen et al. [46] proposed to count the number of inserted triggers that can successfully flip the label. However, although inserting more triggers could benefit attack strength, the triggers also corrupt the sentences gradually, so it is also possible that the poisoned samples become "adversarial", and we can hardly distinguish. Shen et al. [45] also mentioned this issue, and they advised calculating the ASR difference between a poisoned model and a clean model as an effectiveness metric.
Self-playing Adversarial Language Game Enhances LLM Reasoning
We explore the potential of self-play training for large language models (LLMs) in a two-player adversarial language game called Adversarial Taboo. In this game, an attacker and a defender communicate around a target word only visible to the attacker. The attacker aims to induce the defender to speak the target word unconsciously, while the defender tries to infer the target word from the attacker's utterances. To win the game, both players must have sufficient knowledge about the target word and high-level reasoning ability to infer and express in this information-reserved conversation. Hence, we are curious about whether LLMs' reasoning ability can be further enhanced by Self-Playing this Adversarial language Game (SPAG). With this goal, we select several open-source LLMs and let each act as the attacker and play with a copy of itself as the defender on an extensive range of target words. Through reinforcement learning on the game outcomes, we observe that the LLMs' performances uniformly improve on a broad range of reasoning benchmarks. Furthermore, iteratively adopting this self-play process can continuously promote LLMs' reasoning abilities. The code is available at https://github.com/Linear95/SPAG.