Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

May-27-2025, 21:13:46 GMT–Neural Information Processing Systems

Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Existing MIAs designed for large language models (LLMs) can be bifurcated into two types: reference-free and reference-based attacks. Although reference-based attacks appear promising performance by calibrating the probability measured on the target model with reference models, this illusion of privacy risk heavily depends on a reference dataset that closely resembles the training set. Both two types of attacks are predicated on the hypothesis that training records consistently maintain a higher probability of being sampled. However, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. Thus, these reasons lead to high false-positive rates of MIAs in practical scenarios.We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA).

large language model, machine learning, natural language, (10 more...)

Neural Information Processing Systems

May-27-2025, 21:13:46 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Performance Analysis
    - Accuracy (0.98)