Goto

Collaborating Authors

 flexinfer


FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

arXiv.org Artificial Intelligence

Although these approaches can improve models' Large Language Models (LLMs) face challenges for on-device memory efficiency, they inevitably impact the generality inference due to high memory demands. Traditional methods performance and still suffer in extreme resource-constrained to reduce memory usage often compromise performance scenarios [4, 9, 12]. Furthermore, these methods lack the flexibility and lack adaptability. We propose FlexInfer, an optimized to vary memory budgets or deployment constraints, offloading framework for on-device inference, addressing requiring adjusting the hyper-parameters, such as quantization these issues with techniques like asynchronous prefetching, or sparsity levels, offering limited choices, and imposing balanced memory locking, and flexible tensor preservation.


vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.