Goto

Collaborating Authors

 violation


Sequentially Auditing Differential Privacy

Neural Information Processing Systems

We propose a practical sequential test for auditing differential privacy guarantees of black-box mechanisms. The test processes streams of mechanisms' outputs providing anytime-valid inference while controlling Type I error, overcoming the fixed sample size limitation of previous batch auditing methods. Experiments show this test detects violations with sample sizes that are orders of magnitude smaller than existing methods, reducing this number from 50K to a few hundred examples, across diverse realistic mechanisms. Notably, it identifies DP-SGD privacy violations in under one training run, unlike prior methods needing full model training.


Enforcing Hard Linear Constraints in Deep Learning Models with Decision Rules

Neural Information Processing Systems

Deep learning models are increasingly deployed in safety-critical tasks where predictions must satisfy hard constraints, such as physical laws, fairness requirements, or safety limits. However, standard architectures lack built-in mechanisms to enforce such constraints, and existing approaches based on regularization or projection are often limited to simple constraints, computationally expensive, or lack feasibility guarantees. This paper proposes a model-agnostic framework for enforcing input-dependent linear equality and inequality constraints on neural network outputs. The architecture combines a task network trained for prediction accuracy with a safe network trained using decision rules from the stochastic and robust optimization literature to ensure feasibility across the entire input space. The final prediction is a convex combination of the two subnetworks, guaranteeing constraint satisfaction during both training and inference without iterative procedures or runtime optimization. We prove that the architecture is a universal approximator of constrained functions and derive computationally tractable formulations based on linear decision rules. Empirical results on benchmark regression tasks show that our method consistently satisfies constraints while maintaining competitive accuracy and low inference latency.


T-norm Selection for Object Detection in Autonomous Driving with Logical Constraints

Neural Information Processing Systems

Integrating logical constraints into object detection models for autonomous driving (AD) is a promising way to enhance their compliance to rules and thus increase the safety of the system. In this, t-norms have been utilized to calculate the constrained loss, i.e., the violations of logical constraints as losses. While prior works have statically selected few t-norms, we conduct an extensive experimental study to identify the most effective choices, as suboptimal t-norms can lead to undesired model behavior. For this, we present MOD-ECL, a neurosymbolic framework that implements a wide range of t-norms and can use them in an adaptive manner, with an algorithm that selects well-performing t-norms during training and a scheduler that regulates the impact of the constrained loss. We evaluate its effectiveness on the ROAD-R and ROAD-Waymo-R datasets for object detection in AD with attached common-sense constraints. Our results show that careful selection of parameters is crucial for good behavior of the constrained loss and that our framework allows us to obtain not only lower constraint violation but in some cases also an increase in detection performance. Furthermore, our methods allow fine control over the tradeoff between accuracy and violation.1


Security Challenges in AIAgent Deployment: Insights from a Large Scale Public Competition

Neural Information Processing Systems

Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million promptinjection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark--a curated set of high-impact attacks--and evaluate it across 19state-of-the-art models.


Markov Persuasion Processes: Learning to Persuade From Scratch

Neural Information Processing Systems

In Bayesian persuasion, an informed sender strategically discloses information to a receiver so as to persuade them to undertake desirable actions. Recently, Markov persuasion processes (MPPs) have been introduced to capture sequential scenarios where a sender faces a stream of myopic receivers in a Markovian environment. The MPPs studied so far in the literature suffer from issues that prevent them from being fully operational in practice, e.g., they assume that the sender knows receivers' rewards. We fix such issues by addressing MPPs where the sender has no knowledge about the environment.


Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Neural Information Processing Systems

We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons. We show, through conceptual, theoretical, and empirical contributions, that many conclusions are founded on apples-to-oranges comparisons or low-validity measurements. Our arguments are grounded in asking a simple question: When can attack success rates be meaningfully compared? To answer this question, we draw on ideas from social science measurement theory and inferential statistics, which, taken together, provide a conceptual grounding for understanding when numerical values obtained through the quantification of system attributes can be meaningfully compared. Through this lens, we articulate conditions under which ASRs can and cannot be meaningfully compared. Using jailbreaking as a running example, we provide examples and extensive discussion of apples-to-oranges ASRcomparisons and measurement validity challenges.


Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation

Neural Information Processing Systems

We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves eO( K) regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.


Constrained Diffusers for Safe Planning and Control

Neural Information Processing Systems

Diffusion models have shown remarkable potential in planning and control tasks due to their ability to represent multimodal distributions over actions and trajectories. However, ensuring safety under constraints remains a critical challenge for diffusion models. This paper proposes Constrained Diffusers, an extended framework for planning and control that incorporates distribution-level constraints into pretrained diffusion models without retraining or architectural modifications. Inspired by constrained optimization, we apply a constrained Langevin sampling method for the reverse diffusion process that jointly optimizes the trajectory and achieves constraint satisfaction through three iterative algorithms: projected method, primaldual method and augmented Lagrangian method. In addition, we incorporate discrete control barrier functions as constraints for constrained diffusers to guarantee safety in online implementation, following a receding-horizon control that we generate a short-horizon plan and execute only the first action before replanning. Experiments in Maze2D, locomotion, and PyBullet ball running tasks demonstrate that our proposed methods achieve constraint satisfaction with less computation time, and are competitive with existing methods in environments with static and time-varying constraints. The implementation can be found here.



Constrained Discrete Diffusion

Neural Information Processing Systems

Discrete diffusion models are a class of generative models that construct sequences by progressively denoising samples from a categorical noise distribution. Beyond their rapidly growing ability to generate coherent natural language, these models present a new and important opportunity to enforce sequence-level constraints, a capability that current autoregressive models cannot natively provide. This paper capitalizes on this opportunity by introducing Constrained Discrete Diffusion (CDD), a novel integration of differentiable constraint optimization within the diffusion process to ensure adherence to constraints, logic rules, or safety requirements for generated sequences. Unlike conventional text generators that often rely on post-hoc filtering or model retraining for controllable generation, CDD directly imposes constraints into the discrete diffusion sampling process, resulting in a training-free and effective approach. Experiments in toxicity-controlled text generation, property-constrained molecule design, and instruction-constrained text completion demonstrate that CDD achieves zero constraint violations in a diverse array of tasks while preserving fluency, novelty, and coherence, while outperforming autoregressive and existing discrete diffusion approaches.