Better Language Model Inversion by Compactly Representing Next-Token Distributions
–Neural Information Processing Systems
Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method-- prompt inversion from logprob sequences (PILS)--that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion.
Neural Information Processing Systems
Jun-14-2026, 16:10:46 GMT
- Country:
- North America > United States (1.00)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Technology: