Sanyal, Amartya
Differentially Private Steering for Large Language Model Alignment
Goel, Anmol, Hu, Yaxi, Gurevych, Iryna, Sanyal, Amartya
Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with opensource LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, openended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques. LLMs often generate inaccurate, biased or even harmful information that violates human values and preferences (Rawte et al., 2023). In response, recent research has increasingly focused on aligning LLMs towards certain desired behaviors (Konen et al., 2024) while preventing potentially harmful and unsafe outcomes. This has led to the development of several techniques for aligning LLMs, such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022), instruction tuning (Wei et al., 2022), In-Context Learning (ICL) (Dong et al., 2022), and prompt engineering (Cheng et al., 2024). Nevertheless, several challenges remain, including the lack of diverse and representative datasets for alignment (Liu et al., 2024c), difficulties in addressing out-of-distribution issues (Liu et al., 2024a), the choice of alignment strategy (Ivison et al., 2024) and the lack of interpretability in traditional alignment methods (Lee et al., 2024). The linear representation hypothesis (Park et al., 2024b) suggests that high-level concepts are linearly represented as directions in the representation space of LLMs.
Open Problems in Machine Unlearning for AI Safety
Barez, Fazl, Fu, Tingchen, Prabhu, Ameya, Casper, Stephen, Sanyal, Amartya, Bibi, Adel, O'Gara, Aidan, Kirk, Robert, Bucknall, Ben, Fist, Tim, Ong, Luke, Torr, Philip, Lam, Kwok-Yan, Trager, Robert, Krueger, David, Mindermann, Sören, Hernandez-Orallo, José, Geva, Mor, Gal, Yarin
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.
Provable unlearning in topic modeling and downstream tasks
Wei, Stanley, Malladi, Sadhika, Arora, Sanjeev, Sanyal, Amartya
Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model -- i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.
Delta-Influence: Unlearning Poisons via Influence Functions
Li, Wenjie, Li, Jiawei, de Witt, Christian Schroeder, Prabhu, Ameya, Sanyal, Amartya
Addressing data integrity challenges, such as unlearning the effects of data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce $\Delta$-Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using as little as just one poisoned test example. $\Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows $\Delta$-Influence to detect large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against four detection algorithms and five unlearning strategies. We show that $\Delta$-Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning. Our code is publicly available at: \url{https://github.com/andyisokay/delta-influence}
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Jain, Samyak, Lubana, Ekdeep Singh, Oksuz, Kemal, Joy, Tom, Torr, Philip H. S., Sanyal, Amartya, Dokania, Puneet K.
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., "design") versus the specific concepts the task is asked to be performed upon (e.g., a "cycle" vs. a "bomb"). Using this, we investigate three well-known safety fine-tuning methods -- supervised safety fine-tuning, direct preference optimization, and unlearning -- and provide significant evidence demonstrating that these methods minimally transform MLP weights to specifically align unsafe inputs into its weights' null space. This yields a clustering of inputs based on whether the model deems them safe or not. Correspondingly, when an adversarial input (e.g., a jailbreak) is provided, its activations are closer to safer samples, leading to the model processing such an input as if it were safe. We validate our findings, wherever possible, on real-world models -- specifically, Llama-2 7B and Llama-3 8B.
Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation
Sanyal, Amartya, Hu, Yaxi, Yu, Yaodong, Ma, Yian, Wang, Yixin, Schölkopf, Bernhard
"Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
The Role of Learning Algorithms in Collective Action
Ben-Dov, Omri, Fawkes, Jake, Samadi, Samira, Sanyal, Amartya
Collective action in machine learning is the study of the control that a coordinated group can have over machine learning algorithms. While previous research has concentrated on assessing the impact of collectives against Bayes (sub-)optimal classifiers, this perspective is limited in that it does not account for the choice of learning algorithm. Since classifiers seldom behave like Bayes classifiers and are influenced by the choice of learning algorithms along with their inherent biases, in this work we initiate the study of how the choice of the learning algorithm plays a role in the success of a collective in practical settings. Specifically, we focus on distributionally robust optimization (DRO), popular for improving a worst group error, and on the ubiquitous stochastic gradient descent (SGD), due to its inductive bias for "simpler" functions. Our empirical results, supported by a theoretical foundation, show that the effective size and success of the collective are highly dependent on properties of the learning algorithm. This highlights the necessity of taking the learning algorithm into account when studying the impact of collective action in machine learning.
Provable Privacy with Non-Private Pre-Processing
Hu, Yaxi, Sanyal, Amartya, Schölkopf, Bernhard
When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
On the Growth of Mistakes in Differentially Private Online Learning: A Lower Bound Perspective
Dmitriev, Daniil, Szabó, Kristóf, Sanyal, Amartya
With the increasing need to protect the privacy of sensitive user data while conducting meaningful data analysis, Differential Privacy (DP) [14] has become a popular solution. DP algorithms ensure that the impact of any single data sample on the output is limited, thus safeguarding individual privacy. Several works have obtained DP learning algorithms for various learning problems in both theory and practice. However, privacy does not come for free and often leads to a statistical (and sometimes computational) cost. The classical solution for non-private Probably Approximately Correct (PAC) learning [28] is via Empirical Risk Minimisation (ERM) that computes the best solution on the training data. Several works [3, 11] have shown that incorporating DP into ERM incurs a compulsory statistical cost that depends on the dimension of the problem. In the well-known setting of PAC learning with DP, Kasiviswanathan et al. [23] provided the first guarantees that all finite VC classes can be learned with a sample size that grows logarithmically in the size of the class. This line of research was advanced by subsequent works [4, 6, 17], resulting in the findings of Alon et al. [2] which established a surprising equivalence between non-private online learning and Approximate DP-PAC learning.
Corrective Machine Unlearning
Goel, Shashwat, Prabhu, Ameya, Torr, Philip, Kumaraguru, Ponnurangam, Sanyal, Amartya
Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects like vulnerability to backdoored samples, systematic biases, and in general, reduced accuracy on certain input domains. Often, all manipulated training samples are not known, and only a small, representative subset of the affected data is flagged. We formalize "Corrective Machine Unlearning" as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, possibly knowing only a subset of impacted samples. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacy-oriented unlearning. We find most existing unlearning methods, including the gold-standard retraining-from-scratch, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, SSD, achieves limited success in unlearning adverse effects with just a small portion of the manipulated samples, showing the tractability of this setting. We hope our work spurs research towards developing better methods for corrective unlearning and offers practitioners a new strategy to handle data integrity challenges arising from web-scale training.