Goto

Collaborating Authors

 adv


A Appendix

Neural Information Processing Systems

A.1 TPPE Method We present the pseudo code for TPPE in this paper, using the Insertion mode as an example. According to Alg. 1, we reduce the query time complexity from In our study, we assume the worst-case scenario of applying punctuation-level attacks. Softmax layer is adopted to predict the label of the input text. Paraphrase (TPPEP) to achieve a single-shot attack. We describe the TPPEP method as being decomposed into two parts: training and searching.




Meta Internal Learning: Supplementary material Raphael Bensadoun

Neural Information Processing Systems

Next, we would like to prove the opposite direction. All LeakyReLU activations have a slope of 0.02 for negative values except when we use a classic discriminator for single image training, for which we use a slope of 0.2. Additionally, the generator's last conv-block activation at each scale is Tanh instead of ReLU and the discriminator's last We clip the gradient s.t it has a maximal L2 norm of 1 for both the generators and Batch sizes of 16 were used for all experiments involving a dataset of images. At test time, the GPU memory usage is significantly reduced and requires 5GB. In this section, we consider training our method with a "frozen" pretrained ResNet34 i.e., optimizing If the problem could be learned with a "small enough" depth, our method would benefit from even As can be seen, our method yields realistic results with any batch size.



When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yan, Yuping, Xie, Yuhan, Zhang, Yixin, Lyu, Lingjuan, Wang, Handing, Jin, Yaochu

arXiv.org Artificial Intelligence

Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-F ool, a comprehensive study of mul-timodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-F ool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. W e further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.


Unveiling the Latent Directions of Reflection in Large Language Models

Chang, Fu-Chieh, Lee, Yu-Ting, Wu, Pei-Yuan

arXiv.org Artificial Intelligence

Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.


VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

Hu, Qianyue, Wu, Junyan, Lu, Wei, Luo, Xiangyang

arXiv.org Artificial Intelligence

Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.


TRAP: Targeted Redirecting of Agentic Preferences

Kang, Hangoo, Yeon, Jehyeok, Singh, Gagandeep

arXiv.org Artificial Intelligence

Autonomous agentic AI systems powered by vision-language models (VLMs) are rapidly advancing toward real-world deployment, yet their cross-modal reasoning capabilities introduce new attack surfaces for adversarial manipulation that exploit semantic reasoning across modalities. Existing adversarial attacks typically rely on visible pixel perturbations or require privileged model or environment access, making them impractical for stealthy, real-world exploitation. We introduce TRAP, a novel generative adversarial framework that manipulates the agent's decision-making using diffusion-based semantic injections into the vision-language embedding space. Our method combines negative prompt-based degradation with positive semantic optimization, guided by a Siamese semantic network and layout-aware spatial masking. Without requiring access to model internals, TRAP produces visually natural images yet induces consistent selection biases in agentic AI systems. We evaluate TRAP on the Microsoft Common Objects in Context (COCO) dataset, building multi-candidate decision scenarios. Across these scenarios, TRAP consistently induces decision-level preference redirection on leading models, including LLaVA-34B, Gemma3, GPT-4o, and Mistral-3.2, significantly outperforming existing baselines such as SPSA, Bandit, and standard diffusion approaches. These findings expose a critical, generalized vulnerability: autonomous agents can be consistently misled through visually subtle, semantically-guided cross-modal manipulations. Overall, our results show the need for defense strategies beyond pixel-level robustness to address semantic vulnerabilities in cross-modal decision-making. The code for TRAP is accessible on GitHub at https://github.com/uiuc-focal-lab/TRAP.


A Formal description of our method

Neural Information Processing Systems

In this section we provide extended experimental results that show the student's test accuracy over the The student's test accuracy over the training trajectory using hard-distillation corresponding to the experiments of Figure 4. See Section 3.1.2 Figure 8. See Section 3.1.4 Temperature-scaling, a technique introduced in the original paper of Hinton et. Indeed, it is known (see e.g. The results can be found in the table below.