Context-Aware Membership Inference Attacks against Pre-trained Large Language Models

Chang, Hongyan, Shamsabadi, Ali Shahin, Katevas, Kleomenis, Haddadi, Hamed, Shokri, Reza

arXiv.org Machine Learning 

To assess memorization and information leakage in models, Membership Inference Attacks (MIAs) aim to determine if a data point was part of a model's training set [1]. However, MIAs designed for pre-trained Large Language Models (LLMs) have been largely ineffective [2, 3]. This is primarily because these MIAs, originally developed for classification models, fail to account for the sequential nature of LLMs. Unlike classification models, which produce a single prediction, LLMs generate text token-by-token, adjusting each prediction based on the context of preceding tokens (i.e., prefix). Prior MIAs overlook token-level loss dynamics and the influence of prefixes on next-token predictability, which contributes to memorization.