Differentially Private Steering for Large Language Model Alignment

Goel, Anmol, Hu, Yaxi, Gurevych, Iryna, Sanyal, Amartya

Jan-30-2025–arXiv.org Artificial Intelligence

Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with opensource LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, openended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques. LLMs often generate inaccurate, biased or even harmful information that violates human values and preferences (Rawte et al., 2023). In response, recent research has increasingly focused on aligning LLMs towards certain desired behaviors (Konen et al., 2024) while preventing potentially harmful and unsafe outcomes. This has led to the development of several techniques for aligning LLMs, such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022), instruction tuning (Wei et al., 2022), In-Context Learning (ICL) (Dong et al., 2022), and prompt engineering (Cheng et al., 2024). Nevertheless, several challenges remain, including the lack of diverse and representative datasets for alignment (Liu et al., 2024c), difficulties in addressing out-of-distribution issues (Liu et al., 2024a), the choice of alignment strategy (Ivison et al., 2024) and the lack of interpretability in traditional alignment methods (Lee et al., 2024). The linear representation hypothesis (Park et al., 2024b) suggests that high-level concepts are linearly represented as directions in the representation space of LLMs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jan-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.93)
- Europe > Germany
  - Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States
  - California (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (0.68)
- Government (0.93)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)