D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models
–Neural Information Processing Systems
Large language models have shown an impressive societal impact owing to their excellent understanding and logical reasoning skills. However, such strong ability relies on a huge amount of computing resources, which makes it difficult to deploy LLMs on computing resource-constrained platforms. Currently, LLMs process each token equivalently, but we argue that not every word is equally important. Some words should not be allocated excessive computing resources, particularly for dispensable terms in simple questions. In this paper, we propose a novel dynamic inference paradigm for LLMs, namely D-LLMs, which adaptively allocate computing resources in token processing.
Neural Information Processing Systems
May-26-2025, 14:58:22 GMT
- Technology: