SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices

May-26-2025, 18:02:29 GMT–Neural Information Processing Systems

As large language models gain widespread adoption, running them efficiently becomes a crucial task. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models and must offload them to RAM or SSD.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

May-26-2025, 18:02:29 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)