Goto

Collaborating Authors

 Hu, Shengyuan


Position: LLM Unlearning Benchmarks are Weak Measures of Progress

arXiv.org Artificial Intelligence

Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.


Jogging the Memory of Unlearned Model Through Targeted Relearning Attack

arXiv.org Artificial Intelligence

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can 'jog' the memory of unlearned models to reverse the effects of unlearning. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.


Guardrail Baselines for Unlearning in LLMs

arXiv.org Artificial Intelligence

Recent years have seen two trends emerge simultaneously: large language models (LLMs) trained on increasing amounts of user data (generally scraped indiscriminately from the web), in parallel with increasing legal protections on digital data use including data revocation ("right to be forgotten") laws. In order to support data revocation for models that have already been trained on potentially sensitive data, a number of works have proposed approaches for data "unlearning" (Bourtoule et al., 2021; Gupta et al., 2021; Ginart et al., 2019), which aims to remove the influence of specific subsets of training data without entirely retraining a model. Unlearning in LLMs is particularly challenging because individuals' information may not be contained to specific data points (Brown et al., 2022; Tramรจr et al., 2022). Nevertheless, recent work has shown that model finetuning is a promising approach to forget, for example, information corresponding to the book series Harry Potter (Eldan and Russinovich, 2023); information about specific individuals in a synthetic dataset (Maini et al., 2024); or knowledge that could give information to malicious agents Li et al. (2024). While finetuning is a promising approach, a number of recent works have shown that simple modifications to the input prompt or output postprocessing filters (which we collectively call "guardrails") can also be effective for generating a desirable output distribution from a model (Pawelczyk et al., 2023; Brown et al., 2020; Chowdhery et al., 2023; Wei et al., 2021; Kim et al., 2024). Prompt prefixes and postprocessing filters do not update the model weights, so the resulting model itself would not satisfy definitions of unlearning that require the distribution of model weights to match a model retrained from scratch Bourtoule et al. (2021). However, in practical settings where users can only access the model through an API, modifying the output distribution alone can suffice. In fact, most existing unlearning benchmarks (Eldan and Russinovich, 2023; Maini et al., 2024; unl, 2023; Li et al., 2024) only examine the model outputs when evaluating unlearning, which is consistent with a threat model in which users have only API access (see Section 3). In this paper, we investigate how existing benchmarks fare under guardrail-based approaches, and show that in three popular unlearning benchmarks, guardrails not only give strong performance comparable to finetuning baselines, but can also surface weaknesses or inconsistencies in the benchmarks or metrics themselves.


No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

arXiv.org Artificial Intelligence

Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack -- leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.


Privacy Amplification for the Gaussian Mechanism via Bounded Support

arXiv.org Artificial Intelligence

Data-dependent privacy accounting frameworks such as per-instance differential privacy (pDP) and Fisher information loss (FIL) confer fine-grained privacy guarantees for individuals in a fixed training dataset. These guarantees can be desirable compared to vanilla DP in real world settings as they tightly upper-bound the privacy leakage for a $\textit{specific}$ individual in an $\textit{actual}$ dataset, rather than considering worst-case datasets. While these frameworks are beginning to gain popularity, to date, there is a lack of private mechanisms that can fully leverage advantages of data-dependent accounting. To bridge this gap, we propose simple modifications of the Gaussian mechanism with bounded support, showing that they amplify privacy guarantees under data-dependent accounting. Experiments on model training with DP-SGD show that using bounded support Gaussian mechanisms can provide a reduction of the pDP bound $\epsilon$ by as much as 30% without negative effects on model utility.


Private Multi-Task Learning: Formulation and Applications to Federated Learning

arXiv.org Artificial Intelligence

Many problems in machine learning rely on multi-task learning (MTL), in which the goal is to solve multiple related machine learning tasks simultaneously. MTL is particularly relevant for privacy-sensitive applications in areas such as healthcare, finance, and IoT computing, where sensitive data from multiple, varied sources are shared for the purpose of learning. In this work, we formalize notions of client-level privacy for MTL via joint differential privacy (JDP), a relaxation of differential privacy for mechanism design and distributed optimization. We then propose an algorithm for mean-regularized MTL, an objective commonly used for applications in personalized federated learning, subject to JDP. We analyze our objective and solver, providing certifiable guarantees on both privacy and utility. Empirically, we find that our method provides improved privacy/utility trade-offs relative to global baselines across common federated learning benchmarks.


Federated Learning as a Network Effects Game

arXiv.org Artificial Intelligence

Federated Learning (FL) aims to foster collaboration among a population of clients to improve the accuracy of machine learning without directly sharing local data. Although there has been rich literature on designing federated learning algorithms, most prior works implicitly assume that all clients are willing to participate in a FL scheme. In practice, clients may not benefit from joining in FL, especially in light of potential costs related to issues such as privacy and computation. In this work, we study the clients' incentives in federated learning to help the service provider design better solutions and ensure clients make better decisions. We are the first to model clients' behaviors in FL as a network effects game, where each client's benefit depends on other clients who also join the network. Using this setup we analyze the dynamics of clients' participation and characterize the equilibrium, where no client has incentives to alter their decision. Specifically, we show that dynamics in the population naturally converge to equilibrium without needing explicit interventions. Finally, we provide a cost-efficient payment scheme that incentivizes clients to reach a desired equilibrium when the initial network is empty.


Federated Multi-Task Learning for Competing Constraints

arXiv.org Machine Learning

In addition to accuracy, fairness and robustness are two critical concerns for federated learning systems. In this work, we first identify that robustness to adversarial training-time attacks and fairness, measured as the uniformity of performance across devices, are competing constraints in statistically heterogeneous networks. To address these constraints, we propose employing a simple, general multi-task learning objective, and analyze the ability of the objective to achieve a favorable tradeoff between fairness and robustness. We develop a scalable solver for the objective and show that multi-task learning can enable more accurate, robust, and fair models relative to state-of-the-art baselines across a suite of federated datasets.