Fernandes, Earlence
Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API
Labunets, Andrey, Pandya, Nishit V., Hooda, Ashish, Fu, Xiaohan, Fernandes, Earlence
We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.
Misusing Tools in Large Language Models With Visual Adversarial Examples
Fu, Xiaohan, Wang, Zihan, Li, Shuheng, Gupta, Rajesh K., Mireshghallah, Niloofar, Berg-Kirkpatrick, Taylor, Fernandes, Earlence
Large Language Models (LLMs) are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 SSIM). Furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the LLM.
SkillFence: A Systems Approach to Practically Mitigating Voice-Based Confusion Attacks
Hooda, Ashish, Wallace, Matthew, Jhunjhunwalla, Kushal, Fernandes, Earlence, Fawaz, Kassem
Voice assistants are deployed widely and provide useful functionality. However, recent work has shown that commercial systems like Amazon Alexa and Google Home are vulnerable to voice-based confusion attacks that exploit design issues. We propose a systems-oriented defense against this class of attacks and demonstrate its functionality for Amazon Alexa. We ensure that only the skills a user intends execute in response to voice commands. Our key insight is that we can interpret a user's intentions by analyzing their activity on counterpart systems of the web and smartphones. For example, the Lyft ride-sharing Alexa skill has an Android app and a website. Our work shows how information from counterpart apps can help reduce dis-ambiguities in the skill invocation process. We build SkilIFence, a browser extension that existing voice assistant users can install to ensure that only legitimate skills run in response to their commands. Using real user data from MTurk (N = 116) and experimental trials involving synthetic and organic speech, we show that SkillFence provides a balance between usability and security by securing 90.83% of skills that a user will need with a False acceptance rate of 19.83%.
Exploring Adversarial Robustness of Deep Metric Learning
Panum, Thomas Kobber, Wang, Zi, Kan, Pengyu, Fernandes, Earlence, Jha, Somesh
Deep Metric Learning (DML), a widely-used technique, involves learning a distance metric between Traditional deep learning classifiers are vulnerable to adversarial pairs of samples. DML uses deep neural examples (Szegedy et al., 2014; Biggio et al., architectures to learn semantic embeddings 2013) -- inconspicuous input changes that can cause the of the input, where the distance between similar model to output attacker-desired values. Few studies have examples is small while dissimilar ones are far addressed whether DML models are similarly susceptible apart. Although the underlying neural networks towards these attacks, and the results are contradictory produce good accuracy on naturally occurring (Abdelnabi et al., 2020; Panum et al., 2020). Given samples, they are vulnerable to adversariallyperturbed the wide usage of DML models in diverse ML tasks, including samples that reduce performance. We security-oriented ones, it is important to clarify take a first step towards training robust DML their susceptibility towards attacks and ultimately address models and tackle the primary challenge of the their lack of robustness. We investigate the vulnerability of metric losses being dependent on the samples DML towards these attacks and address the open problem in a mini-batch, unlike standard losses that only of training DML models using robust optimization techniques depend on the specific input-output pair.
Sequential Attacks on Kalman Filter-based Forward Collision Warning Systems
Ma, Yuzhe, Sharp, Jon, Wang, Ruizhe, Fernandes, Earlence, Zhu, Xiaojin
Kalman Filter (KF) is widely used in various domains to perform sequential learning or variable estimation. In the context of autonomous vehicles, KF constitutes the core component of many Advanced Driver Assistance Systems (ADAS), such as Forward Collision Warning (FCW). It tracks the states (distance, velocity etc.) of relevant traffic objects based on sensor measurements. The tracking output of KF is often fed into downstream logic to produce alerts, which will then be used by human drivers to make driving decisions in near-collision scenarios. In this paper, we study adversarial attacks on KF as part of the more complex machine-human hybrid system of Forward Collision Warning. Our attack goal is to negatively affect human braking decisions by causing KF to output incorrect state estimations that lead to false or delayed alerts. We accomplish this by sequentially manipulating measure ments fed into the KF, and propose a novel Model Predictive Control (MPC) approach to compute the optimal manipulation. Via experiments conducted in a simulated driving environment, we show that the attacker is able to successfully change FCW alert signals through planned manipulation over measurements prior to the desired target time. These results demonstrate that our attack can stealthily mislead a distracted human driver and cause vehicle collisions.
Analyzing the Interpretability Robustness of Self-Explaining Models
Zheng, Haizhong, Fernandes, Earlence, Prakash, Atul
Recently, interpretable models called self-explaining models (SEMs) have been proposed with the goal of providing interpretability robustness. We evaluate the interpretability robustness of SEMs and show that explanations provided by SEMs as currently proposed are not robust to adversarial inputs. Specifically, we successfully created adversarial inputs that do not change the model outputs but cause significant changes in the explanations. We find that even though current SEMs use stable co-efficients for mapping explanations to output labels, they do not consider the robustness of the first stage of the model that creates interpretable basis concepts from the input, leading to non-robust explanations. Our work makes a case for future work to start examining how to generate interpretable basis concepts in a robust way.