AITopics | adversarial perturbation

Collaborating Authors

adversarial perturbation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Fu, Shaopeng, Wang, Di

arXiv.org Machine LearningApr-15-2026

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.

large language model, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2604.12817

Country: Europe > United Kingdom (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Lower bounds on the robustness to adversarial perturbations

Neural Information Processing SystemsMar-17-2026, 13:03:52 GMT

The input-output mappings learned by state-of-the-art neural networks are significantly discontinuous. It is possible to cause a neural network used for image recognition to misclassify its input by applying very specific, hardly perceptible perturbations to the input, called adversarial perturbations. Many hypotheses have been proposed to explain the existence of these peculiar samples as well as several methods to mitigate them. A proven explanation remains elusive, however. In this work, we take steps towards a formal characterization of adversarial perturbations by deriving lower bounds on the magnitudes of perturbations necessary to change the classification of neural networks. The bounds are experimentally verified on the MNIST and CIFAR-10 data sets.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.77)

Add feedback

Adversarial vulnerability for any classifier

Neural Information Processing SystemsMar-16-2026, 23:01:02 GMT

Despite achieving impressive performance, state-of-the-art classifiers remain highly vulnerable to small, imperceptible, adversarial perturbations. This vulnerability has proven empirically to be very intricate to address. In this paper, we study the phenomenon of adversarial perturbations under the assumption that the data is generated with a smooth generative model. We derive fundamental upper bounds on the robustness to perturbations of any classification function, and prove the existence of adversarial perturbations that transfer well across different classifiers with small risk. Our analysis of the robustness also provides insights onto key properties of generative models, such as their smoothness and dimensionality of latent space. We conclude with numerical experimental results showing that our bounds provide informative baselines to the maximal achievable robustness on several datasets.

artificial intelligence, name change, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.66)

Add feedback

Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution

Neural Information Processing SystemsMar-16-2026, 19:53:25 GMT

We study adversarial perturbations when the instances are uniformly distributed over {0,1}^n. We study both inherent bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy for these definitions based on their direct goals; we identify one of them as the one guaranteeing misclassification by pushing the instances to the error region. We then study some classic algorithms for learning monotone conjunctions and compare their adversarial risk and robustness under different definitions by attacking the hypotheses using instances drawn from the uniform distribution. We observe that sometimes these definitions lead to significantly different bounds. Thus, this study advocates for the use of the error-region definition, even though other definitions, in other contexts with context-dependent assumptions, may coincide with the error-region definition. Using the error-region definition of adversarial perturbations, we then study inherent bounds on risk and robustness of any classifier for any classification problem whose instances are uniformly distributed over {0,1}^n. Using the isoperimetric inequality for the Boolean hypercube, we show that for initial error 0.01, there always exists an adversarial perturbation that changes O( n) bits of the instances to increase the risk to 0.5, making classifier's decisions meaningless. Furthermore, by also using the central limit theorem we show that when n, at most c n bits of perturbations, for a universal constant c<1.17, suffice for increasing the risk to 0.5, and the same c n bits of perturbations on average suffice to increase the risk to 1, hence bounding the robustness by c n.

artificial intelligence, machine learning, perturbation, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.38)

Add feedback

Adversarial Examples Exist in Two-Layer ReLU Networks for Low Dimensional Linear Subspaces

Neural Information Processing SystemsFeb-18-2026, 23:37:00 GMT

Neural networks are observed to be susceptible to adversarial perturbations [Szegedy et al., 2013],

artificial intelligence, machine learning, perturbation, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

AUnifiedGame-TheoreticInterpretationof AdversarialRobustness

Neural Information Processing SystemsFeb-18-2026, 22:59:55 GMT

This paper provides a unified view to explain different adversarial attacks and defensemethods,i.e.

artificial intelligence, interaction, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
Asia > China (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.93)

Add feedback

Pre-trainedAdversarialPerturbations

Neural Information Processing SystemsFeb-18-2026, 21:53:58 GMT

However, much less effort has been devoted to exploring the adversarial robustness of pre-trained models.

machine learning, natural language, pre-trained model, (19 more...)

Neural Information Processing Systems

Country: Asia > China > Beijing > Beijing (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

DAT: Improving Adversarial Robustness via Generative Amplitude Mix-up in Frequency Domain Fengpeng Li1 Kemou Li1 Haiwei Wu

Neural Information Processing SystemsFeb-18-2026, 12:14:48 GMT

Recent studies show that adversarial attacks disproportionately impact the patterns within the phase of the sample's frequency spectrum--typically containing

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: