Bayesian Inference of Training Dataset Membership

Huang, Yongchao

arXiv.org Artificial Intelligence 

Machine learning models, particularly deep neural networks, are vulnerable to privacy attacks such as membership inference attacks (MIAs), which determine whether a specific data point was included in a model's training set [9, 10, 2]. These attacks exploit the tendency of models to exhibit distinct behaviors (e.g. higher confidence or lower loss) on training data compared to unseen data, potentially compromising the confidentiality of sensitive datasets, such as those containing medical or financial records. State-of-the-art MIAs typically rely on extensive knowledge of the target model. For example, shadow model-based approaches [9] train multiple models to mimic the target's behavior, while others, e.g. the likelihood ratio attack (LiRA) by Carlini et al. [2], leverage model outputs or gradients. These methods often induce significant computational costs or require access to model internals, limiting their applicability in scenarios where only model outputs are available. We propose a new MIA method that leverages Bayesian inference for post-hoc analysis of trained model and datasets. Once a ML model, e.g. a neural network, has been trained on member datasets, we pass the test data through the trained ML model, and extract resulting metrics such as accuracy, entropy, perturbation magnitude, and dataset statistics, and uses these metrics to compute posterior probabilities of membership. This approach doesn't require access to a'training' set, although known knowledge about member and non-member datasets can improve its performance. This post-hoc method is computationally efficient, interpretable, requires minimum model query and fine-tuning, making it well-suited for real-world deployment scenarios where privacy assessments are conducted after model training.