ImpMIA: Leveraging Implicit Bias for Membership Inference Attack under Realistic Scenarios
Golbari, Yuval, Wasserman, Navve, Vardi, Gal, Irani, Michal
–arXiv.org Artificial Intelligence
Determining which data samples were used to train a model--known as Membership Inference Attack (MIA)--is a well-studied and important problem with implications for data privacy. Black-box methods presume access only to the model's outputs and often rely on training auxiliary reference models. While they have shown strong empirical performance, they rely on assumptions that rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. In this paper, we demonstrate that removing these assumptions leads to a significant drop in the performance of black-box attacks. We introduce ImpMIA, a Membership Inference Attack that exploits the Implicit Bias of neural networks, hence removes the need to rely on any reference models and their assumptions. ImpMIA is a white-box attack - a setting which assumes access to model weights and is becoming increasingly realistic given that many models are publicly available (e.g., via Hugging Face). Building on maximum-margin implicit bias theory, ImpMIA uses the Karush-Kuhn-Tucker (KKT) optimality conditions to identify training samples. This is done by finding the samples whose gradients most strongly reconstruct the trained model's parameters. As a result, ImpMIA achieves state-of-the-art performance compared to both black and white box attacks in realistic settings where only the model weights and a superset of the training data are available. Ensuring that trained models do not leak information about their training sets is a critical challenge. Membership inference attacks (MIAs) evaluate this risk by determining whether a given example was part of a model's training data. MIAs can be broadly divided into two categories: black-box, which assume only query access to model outputs (Shokri et al., 2017; Y eom et al., 2018; Li & Zhang, 2021; Carlini et al., 2022), and white-box, which exploit access to internal parameters such as weights or gradients (Nasr et al., 2019; Leino & Fredrikson, 2020; Cohen & Giryes, 2024). The most effective black-box MIAs are reference-model-based attacks. These methods estimate the distribution of losses for members (training samples) versus non-members by training auxiliary reference models that mimic the target model, thereby learning its loss behavior. However, training large sets of reference models is computationally expensive, and--more importantly--their effectiveness depends on the reference models being accurate surrogates of the target.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Asia > Middle East > Israel (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: