attention computation
- North America > United States (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- North America > United States > Virginia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Workflow (0.93)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction
Large Language Models (LLMs) are widely used in today's tasks of natural language processing. To support applications like multi-turn chats, document understanding, and content generation, models with long context lengths are growing in importance.However, managing long contexts brings substantial challenges due to the expansion of key-value cache (KV cache). Longer KV cache requires larger memory, limiting the batch-size thus decreasing throughput. Also, computing attention over long KV cache incurs more memory access, hurting the end-to-end latency.Prior works find that it is sufficient to use only the recent and high-impact tokens for attention computation, allowing the eviction of less vital tokens to shrink cache size.Nonetheless, we observe a dynamic shift in token importance across different decoding steps. Tokens initially evicted might regain importance after certain decoding steps.To address this, we propose ArkVale, a page-based KV cache manager that can recognize and recall currently important tokens evicted before. We asynchronously copy the filled page into external memory (e.g., CPU memory) as backup and summarize it into a much smaller digest by constructing the bounding-volume of its keys. Before attention computation, we measure all pages' importance based on their digests, recall the important ones, evict the unimportant ones, and select the top-ranked pages for attention computation. Experiment results show that ArkVale performs well on various long context tasks with negligible accuracy loss under 2k$\sim$4k cache budget and can improve decoding latency to $2.2\times$ and batching throughput to $4.6\times$ because it applies attention on only a small subset of pages and reduce per-sample memory usage of KV cache.
Fast Attention Requires Bounded Entries
In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix $\mathrm{Att}(Q,K,V):= \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$, where $A = \exp(QK^\top/d)$ is the `attention matrix', and $\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the $n \times n$ attention matrix $A$, and hence require time $\Omega(n^2)$ even when $d = n^{o(1)}$ is small. In this paper, we investigate whether faster algorithms are possible by \emph{implicitly} making use of the matrix $A$. We present two results, showing that there is a sharp transition at $B = \Theta(\sqrt{\log n})$.$\bullet$
Grounding Spatio-Temporal Language with Transformers
Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations.
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
Chen, Siqi, Hong, Ke, Zhao, Tianchen, Xie, Ruiqi, Zhu, Zhenhua, Zhang, Xudong, Wang, Yu
Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at https://github.com/thu-nics/db-SP.
LiteAttention: A Temporal Sparse Attention for Diffusion Transformers
Shmilovich, Dor, Wu, Tony, Dahan, Aviad, Domb, Yuval
Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
Kim, Dowon, Lee, MinJae, Kim, Janghyeon, Kwon, HyuckSung, Jeong, Hyeonggyu, Park, Sang-Soo, Yoon, Minyong, Roh, Si-Dong, Kwon, Yongsuk, So, Jinin, Choi, Jungwook
The expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables non-eviction frameworks that offload the full KV-cache to scalable external memory, these frameworks still suffer from costly data transfers when recalling non-resident KV tokens to limited GPU memory as context lengths increase. This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference, a CXL-enabled KV-cache management system that coordinates memory and computation beyond GPU limits. Our design offloads token page selection to a PNM accelerator within CXL memory, eliminating costly recalls and enabling larger GPU batch sizes. We further introduce a hybrid parallelization strategy and a steady-token selection mechanism to enhance compute efficiency and scalability. Implemented atop a state-of-the-art CXL-PNM system, our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts. Our PNM-only offloading scheme (PNM-KV) and GPU-PNM hybrid with steady-token execution (PnG-KV) achieve up to 21.9x throughput improvement, up to 60x lower energy per token, and up to 7.3x better total cost efficiency than the baseline, demonstrating that CXL-enabled multi-PNM architectures can serve as a scalable backbone for future long-context LLM inference.
$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
Fan, Yingqi, Zhao, Anhao, Fu, Jinlan, Tong, Junlong, Su, Hui, Pan, Yijie, Zhang, Wei, Shen, Xiaoyu
Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.
- Asia > China > Zhejiang Province > Ningbo (0.04)
- Asia > Singapore (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- (2 more...)