adversarial manipulation
Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation
Recent work has shown that state-of-the-art classifiers are quite brittle, in the sense that a small adversarial change of an originally with high confidence correctly classified input leads to a wrong classification again with high confidence. This raises concerns that such classifiers are vulnerable to attacks and calls into question their usage in safety-critical systems. We show in this paper for the first time formal guarantees on the robustness of a classifier by giving instance-specific \emph{lower bounds} on the norm of the input manipulation required to change the classifier decision. Based on this analysis we propose the Cross-Lipschitz regularization functional. We show that using this form of regularization in kernel methods resp.
Adversarial Manipulation of Reasoning Models using Internal Representations
Yamaguchi, Kureha, Etheridge, Benjamin, Arditi, Andy
Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (0.94)
- Health & Medicine > Therapeutic Area > Immunology (0.70)
Reviews: Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation
This paper fills an important gap in the literature of robustness of classifiers to adversarial examples by proposing the first (to the best of my knowledge) formal guarantee (at an example level) on the robustness of a given classifier to adversarial examples. Unsurprisingly, the bound involves the Lipschitz constant of the Jacobians which the authors exploit to propose a cross-Lipschitz regularization. Overall the paper is well written, and the material is well presented. The proof of Theorem 2.1 is correct. I did not check the proofs of the propositions 2.1 and 4.1. This is an interesting work.
Assessing Neural Network Robustness via Adversarial Pivotal Tuning
Christensen, Peter Ebert, Snæbjarnarson, Vésteinn, Dittadi, Andrea, Belongie, Serge, Benaim, Sagie
The ability to assess the robustness of image classifiers to a diverse set of manipulations is essential to their deployment in the real world. Recently, semantic manipulations of real images have been considered for this purpose, as they may not arise using standard adversarial settings. However, such semantic manipulations are often limited to style, color or attribute changes. While expressive, these manipulations do not consider the full capacity of a pretrained generator to affect adversarial image manipulations. In this work, we aim at leveraging the full capacity of a pretrained image generator to generate highly detailed, diverse and photorealistic image manipulations. Inspired by recent GAN-based image inversion methods, we propose a method called Adversarial Pivotal Tuning (APT). APT first finds a pivot latent space input to a pretrained generator that best reconstructs an input image. It then adjusts the weights of the generator to create small, but semantic, manipulations which fool a pretrained classifier. Crucially, APT changes both the input and the weights of the pretrained generator, while preserving its expressive latent editing capability, thus allowing the use of its full capacity in creating semantic adversarial manipulations. We demonstrate that APT generates a variety of semantic image manipulations, which preserve the input image class, but which fool a variety of pretrained classifiers. We further demonstrate that classifiers trained to be robust to other robustness benchmarks, are not robust to our generated manipulations and propose an approach to improve the robustness towards our generated manipulations. Code available at: https://captaine.github.io/apt/
- North America > United States (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
Artificial Intelligence: Too Fragile to Fight?
You can become utterly dependent on a new glamorous technology, be it cyber-space, artificial intelligence. . . But does it create a potential achilles heel? Artificial intelligence (AI) has become the technical focal point for advancing naval and Department of Defense (DoD) capabilities. Secretary of the Navy Carlos Del Toro listed AI first among his priorities for innovating U.S. naval forces. Chief of Naval Operations Admiral Michael Gilday listed it as his top priority during his Senate confirmation hearing.2
- North America > United States (1.00)
- Europe > France (0.04)
- Asia > Middle East > Iran > Hormozgan Province > Bandar Abbas (0.04)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)
Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation
Hein, Matthias, Andriushchenko, Maksym
Recent work has shown that state-of-the-art classifiers are quite brittle, in the sense that a small adversarial change of an originally with high confidence correctly classified input leads to a wrong classification again with high confidence. This raises concerns that such classifiers are vulnerable to attacks and calls into question their usage in safety-critical systems. We show in this paper for the first time formal guarantees on the robustness of a classifier by giving instance-specific \emph{lower bounds} on the norm of the input manipulation required to change the classifier decision. Based on this analysis we propose the Cross-Lipschitz regularization functional. We show that using this form of regularization in kernel methods resp.
Army scientists train machine learning models to wrangle dirty data
Army researchers have developed a new approach for training machine learning models that can better withstand dirty and deceptive data. Models trained under this method have greatly surpassed other state-of-the-art models in terms of robustness, scientists said. Machines outperform humans in many data-processing tasks, but sometimes fall victim to obvious mistakes that humans can see a mile away. Scientists at the U.S. Army Combat Capabilities Development Command's Army Research Laboratory designed a new approach that makes it harder for adversaries to trick machine learning models. "We were able to reduce model complexity by about a factor of 10 without affecting other performance metrics under benign conditions," said Army scientist Dr. Ananthram Swami.
- North America > United States > Maryland > Prince George's County > Adelphi (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
- Government > Military > Army (1.00)
- Government > Regional Government > North America Government > United States Government (0.38)
Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals
This paper studies reinforcement learning (RL) under malicious falsification on cost signals and introduces a quantitative framework of attack models to understand the vulnerabilities of RL. Focusing on $Q$-learning, we show that $Q$-learning algorithms converge under stealthy attacks and bounded falsifications on cost signals. We characterize the relation between the falsified cost and the $Q$-factors as well as the policy learned by the learning agent which provides fundamental limits for feasible offensive and defensive moves. We propose a robust region in terms of the cost within which the adversary can never achieve the targeted policy. We provide conditions on the falsified cost which can mislead the agent to learn an adversary's favored policy. A numerical case study of water reservoir control is provided to show the potential hazards of RL in learning-based control systems and corroborate the results.
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- Energy (0.93)
- Information Technology > Security & Privacy (0.68)
A Fundamental Performance Limitation for Adversarial Classification
Makdah, Abed AlRahman Al, Katewa, Vaibhav, Pasqualetti, Fabio
Despite the widespread use of machine learning algorithms to solve problems of technological, economic, and social relevance, provable guarantees on the performance of these data-driven algorithms are critically lacking, especially when the data originates from unreliable sources and is transmitted over unprotected and easily accessible channels. In this paper we take an important step to bridge this gap and formally show that, in a quest to optimize their accuracy, binary classification algorithms -- including those based on machine-learning techniques -- inevitably become more sensitive to adversarial manipulation of the data. Further, for a given class of algorithms with the same complexity (i.e., number of classification boundaries), the fundamental tradeoff curve between accuracy and sensitivity depends solely on the statistics of the data, and cannot be improved by tuning the algorithm.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- (6 more...)
Neural Networks Should Be Wide Enough to Learn Disconnected Decision Regions
Nguyen, Quynh, Mukkamala, Mahesh, Hein, Matthias
In the recent literature the important role of depth in deep learning has been emphasized. In this paper we argue that sufficient width of a feedforward network is equally important by answering the simple question under which conditions the decision regions of a neural network are connected. It turns out that for a class of activation functions including leaky ReLU, neural networks having a pyramidal structure, that is no layer has more hidden units than the input dimension, produce necessarily connected decision regions. This implies that a sufficiently wide layer is necessary to produce disconnected decision regions. We discuss the implications of this result for the construction of neural networks, in particular the relation to the problem of adversarial manipulation of classifiers.
- Europe > Germany > Saarland (0.04)
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)