Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass Sarah Pratt
–Neural Information Processing Systems
Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing k drafts to the user requires running an expensive language model k times. To alleviate the computation cost of running k inference passes, we propose Superposed Decoding, a new decoding algorithm that generates k drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the k drafts as input to the next decoding step of the language model.
Neural Information Processing Systems
Mar-27-2025, 10:22:45 GMT
- Country:
- Asia > India
- Maharashtra (0.14)
- North America
- Mexico > Veracruz (0.14)
- United States > California (0.14)
- Asia > India
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Education (0.69)
- Health & Medicine
- Consumer Health (0.46)
- Therapeutic Area (0.69)
- Information Technology > Services (0.46)
- Law (0.68)
- Technology: