Language Model Inversion
Morris, John X., Zhao, Wenting, Chiu, Justin T., Shmatikov, Vitaly, Rush, Alexander M.
–arXiv.org Artificial Intelligence
Language models produce a distribution over the next token; can we use this to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of 59 and token-level F1 of 78 and recovers 27% of prompts exactly. Language models are autoregressive, outputting the probability of each next token in a sequence conditioned on the preceeding text. This distribution is used to generate future tokens in the sequence. Can this distribution also be used to reconstruct the prompt? In most contexts, this question is pointless, since we have already conditioned on this information. However, increasingly language models are being offered "as a service" where the user may have access to the outputs, but not all of the true prompt. In this context, it may be of interest to users to know the prompt and, perhaps, for the service provider to protect it. This goal has been the focus of "jailbreaking" approaches that attempt to use the forward text generation of the model to reveal the prompt. We formalize this problem of prompt reconstruction as language model inversion, recovering the input prompt conditioned on the language model's next-token probabilities. Interestingly, work in computer vision has shown that probability predictions of image classifiers retain a surprising amount of detail (Dosovitskiy & Brox, 2016), so it is plausible that this also holds for language models. We propose an architecture that predicts prompts by"unrolling" the distribution vector into a sequence that can be processed effectively by a pretrained encoder-decoder language model. This method shows for the first time that language model predictions are mostly invertible: in many cases, we are able to recover very similar inputs to the original, sometimes getting the input text back exactly.
arXiv.org Artificial Intelligence
Nov-22-2023
- Country:
- South America > Colombia
- Bogotá D.C. > Bogotá (0.04)
- North America
- United States
- Texas (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- New York > New York County
- New York City (0.04)
- Canada > Ontario
- Toronto (0.04)
- United States
- Asia
- Uzbekistan (0.04)
- Middle East
- Iran (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- South America > Colombia
- Genre:
- Research Report (1.00)
- Personal > Honors (0.46)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Leisure & Entertainment (0.46)
- Technology: