SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices
–Neural Information Processing Systems
As large language models gain widespread adoption, running them efficiently becomes a crucial task. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models and must offload them to RAM or SSD.
Neural Information Processing Systems
May-26-2025, 18:02:29 GMT
- Technology: