Differentially Private Learning Needs Better Model Initialization and Self-Distillation

Ngong, Ivoline C., Near, Joseph P., Mireshghallah, Niloofar

Oct-23-2024–arXiv.org Artificial Intelligence

DPSGD to fine-tune these models on private data often yields poor results, particularly when the private Differentially private SGD (DPSGD) enables dataset is small (Tramèr et al., 2022; Mireshghallah privacy-preserving training of language models, et al., 2021). Recent work has shown that leveraging but often reduces utility, diversity, and linguistic better hand-crafted features (Tramer and Boneh, 2020) quality. We introduce DPRefine, a threephase or features from large pre-trained language models (Li method that initializes a model using et al., 2022, 2021) can improve the privacy-utility tradeoff data synthesis from a small pre-trained LM in differentially private learning. However, these with rigorous filtering, applies DP finetuning approaches have limitations: smaller pre-trained models on private data, and performs self-distillation offer limited benefits, and fine-tuning larger models on to refine outputs. This approach significantly private data may be infeasible due to proprietary concerns outperforms vanilla DPSGD, with AlpacaEval or infrastructure limitations. This raises a critical preferring DPRefine's generations in 78.4% question: Can we develop small, domain-specific language of cases across all datasets. Our analysis reveals models that achieve high performance without that DPRefine reduces linguistic errors in requiring large private datasets or large, pre-trained generated text by 84.0%, mitigating grammar models?

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-23-2024

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- Europe > United Kingdom (1.00)
- North America > United States (1.00)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Government
  - Military (1.00)
  - Regional Government > North America Government
    - United States Government (0.68)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Cardiology/Vascular Diseases (1.00)
    - Immunology (0.93)
    - Infections and Infectious Diseases (1.00)
    - Oncology (0.67)
    - Psychiatry/Psychology (0.67)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Leisure & Entertainment > Sports (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)
    - Natural Language
      - Large Language Model (0.69)
      - Text Processing (0.67)
  - Security & Privacy (1.00)