AITopics | flexinfer

Collaborating Authors

flexinfer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

Du, Hongchao, Wu, Shangyu, Kharlamova, Arina, Guan, Nan, Xue, Chun Jason

arXiv.org Artificial IntelligenceMar-4-2025

Although these approaches can improve models' Large Language Models (LLMs) face challenges for on-device memory efficiency, they inevitably impact the generality inference due to high memory demands. Traditional methods performance and still suffer in extreme resource-constrained to reduce memory usage often compromise performance scenarios [4, 9, 12]. Furthermore, these methods lack the flexibility and lack adaptability. We propose FlexInfer, an optimized to vary memory budgets or deployment constraints, offloading framework for on-device inference, addressing requiring adjusting the hyper-parameters, such as quantization these issues with techniques like asynchronous prefetching, or sparsity levels, offering limited choices, and imposing balanced memory locking, and flexible tensor preservation.

flexinfer, wang, zhang, (13 more...)

arXiv.org Artificial Intelligence

2503.03777

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > China > Hong Kong (0.05)
North America > United States > New York > New York County > New York City (0.04)
(10 more...)

Genre: Research Report (1.00)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

Xu, Jiale, Zhang, Rui, Guo, Cong, Hu, Weiming, Liu, Zihan, Wu, Feiyang, Feng, Yu, Sun, Shixuan, Shao, Changxu, Guo, Yuhong, Zhao, Junping, Zhang, Ke, Guo, Minyi, Leng, Jingwen

arXiv.org Artificial IntelligenceJul-22-2024

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.

kernel, opération, vtensor, (15 more...)

arXiv.org Artificial Intelligence

2407.15309

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > Carlsbad (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.49)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback