Goto

Collaborating Authors

 adversarial manipulation


Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation

Neural Information Processing Systems

Recent work has shown that state-of-the-art classifiers are quite brittle, in the sense that a small adversarial change of an originally with high confidence correctly classified input leads to a wrong classification again with high confidence. This raises concerns that such classifiers are vulnerable to attacks and calls into question their usage in safety-critical systems. We show in this paper for the first time formal guarantees on the robustness of a classifier by giving instance-specific \emph{lower bounds} on the norm of the input manipulation required to change the classifier decision. Based on this analysis we propose the Cross-Lipschitz regularization functional. We show that using this form of regularization in kernel methods resp.


Adversarial Manipulation of Reasoning Models using Internal Representations

Yamaguchi, Kureha, Etheridge, Benjamin, Arditi, Andy

arXiv.org Artificial Intelligence

Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation.


Reviews: Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation

Neural Information Processing Systems

This paper fills an important gap in the literature of robustness of classifiers to adversarial examples by proposing the first (to the best of my knowledge) formal guarantee (at an example level) on the robustness of a given classifier to adversarial examples. Unsurprisingly, the bound involves the Lipschitz constant of the Jacobians which the authors exploit to propose a cross-Lipschitz regularization. Overall the paper is well written, and the material is well presented. The proof of Theorem 2.1 is correct. I did not check the proofs of the propositions 2.1 and 4.1. This is an interesting work.


Assessing Neural Network Robustness via Adversarial Pivotal Tuning

Christensen, Peter Ebert, Snæbjarnarson, Vésteinn, Dittadi, Andrea, Belongie, Serge, Benaim, Sagie

arXiv.org Artificial Intelligence

The ability to assess the robustness of image classifiers to a diverse set of manipulations is essential to their deployment in the real world. Recently, semantic manipulations of real images have been considered for this purpose, as they may not arise using standard adversarial settings. However, such semantic manipulations are often limited to style, color or attribute changes. While expressive, these manipulations do not consider the full capacity of a pretrained generator to affect adversarial image manipulations. In this work, we aim at leveraging the full capacity of a pretrained image generator to generate highly detailed, diverse and photorealistic image manipulations. Inspired by recent GAN-based image inversion methods, we propose a method called Adversarial Pivotal Tuning (APT). APT first finds a pivot latent space input to a pretrained generator that best reconstructs an input image. It then adjusts the weights of the generator to create small, but semantic, manipulations which fool a pretrained classifier. Crucially, APT changes both the input and the weights of the pretrained generator, while preserving its expressive latent editing capability, thus allowing the use of its full capacity in creating semantic adversarial manipulations. We demonstrate that APT generates a variety of semantic image manipulations, which preserve the input image class, but which fool a variety of pretrained classifiers. We further demonstrate that classifiers trained to be robust to other robustness benchmarks, are not robust to our generated manipulations and propose an approach to improve the robustness towards our generated manipulations. Code available at: https://captaine.github.io/apt/


Artificial Intelligence: Too Fragile to Fight?

#artificialintelligence

You can become utterly dependent on a new glamorous technology, be it cyber-space, artificial intelligence. . . But does it create a potential achilles heel? Artificial intelligence (AI) has become the technical focal point for advancing naval and Department of Defense (DoD) capabilities. Secretary of the Navy Carlos Del Toro listed AI first among his priorities for innovating U.S. naval forces. Chief of Naval Operations Admiral Michael Gilday listed it as his top priority during his Senate confirmation hearing.2


Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation

Hein, Matthias, Andriushchenko, Maksym

Neural Information Processing Systems

Recent work has shown that state-of-the-art classifiers are quite brittle, in the sense that a small adversarial change of an originally with high confidence correctly classified input leads to a wrong classification again with high confidence. This raises concerns that such classifiers are vulnerable to attacks and calls into question their usage in safety-critical systems. We show in this paper for the first time formal guarantees on the robustness of a classifier by giving instance-specific \emph{lower bounds} on the norm of the input manipulation required to change the classifier decision. Based on this analysis we propose the Cross-Lipschitz regularization functional. We show that using this form of regularization in kernel methods resp.


Army scientists train machine learning models to wrangle dirty data

#artificialintelligence

Army researchers have developed a new approach for training machine learning models that can better withstand dirty and deceptive data. Models trained under this method have greatly surpassed other state-of-the-art models in terms of robustness, scientists said. Machines outperform humans in many data-processing tasks, but sometimes fall victim to obvious mistakes that humans can see a mile away. Scientists at the U.S. Army Combat Capabilities Development Command's Army Research Laboratory designed a new approach that makes it harder for adversaries to trick machine learning models. "We were able to reduce model complexity by about a factor of 10 without affecting other performance metrics under benign conditions," said Army scientist Dr. Ananthram Swami.


Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals

Huang, Yunhan, Zhu, Quanyan

arXiv.org Artificial Intelligence

This paper studies reinforcement learning (RL) under malicious falsification on cost signals and introduces a quantitative framework of attack models to understand the vulnerabilities of RL. Focusing on $Q$-learning, we show that $Q$-learning algorithms converge under stealthy attacks and bounded falsifications on cost signals. We characterize the relation between the falsified cost and the $Q$-factors as well as the policy learned by the learning agent which provides fundamental limits for feasible offensive and defensive moves. We propose a robust region in terms of the cost within which the adversary can never achieve the targeted policy. We provide conditions on the falsified cost which can mislead the agent to learn an adversary's favored policy. A numerical case study of water reservoir control is provided to show the potential hazards of RL in learning-based control systems and corroborate the results.


A Fundamental Performance Limitation for Adversarial Classification

Makdah, Abed AlRahman Al, Katewa, Vaibhav, Pasqualetti, Fabio

arXiv.org Machine Learning

Despite the widespread use of machine learning algorithms to solve problems of technological, economic, and social relevance, provable guarantees on the performance of these data-driven algorithms are critically lacking, especially when the data originates from unreliable sources and is transmitted over unprotected and easily accessible channels. In this paper we take an important step to bridge this gap and formally show that, in a quest to optimize their accuracy, binary classification algorithms -- including those based on machine-learning techniques -- inevitably become more sensitive to adversarial manipulation of the data. Further, for a given class of algorithms with the same complexity (i.e., number of classification boundaries), the fundamental tradeoff curve between accuracy and sensitivity depends solely on the statistics of the data, and cannot be improved by tuning the algorithm.


Neural Networks Should Be Wide Enough to Learn Disconnected Decision Regions

Nguyen, Quynh, Mukkamala, Mahesh, Hein, Matthias

arXiv.org Machine Learning

In the recent literature the important role of depth in deep learning has been emphasized. In this paper we argue that sufficient width of a feedforward network is equally important by answering the simple question under which conditions the decision regions of a neural network are connected. It turns out that for a class of activation functions including leaky ReLU, neural networks having a pyramidal structure, that is no layer has more hidden units than the input dimension, produce necessarily connected decision regions. This implies that a sufficiently wide layer is necessary to produce disconnected decision regions. We discuss the implications of this result for the construction of neural networks, in particular the relation to the problem of adversarial manipulation of classifiers.