AITopics | recomputation

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > Promising Solution (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Software (0.93)
(2 more...)

Aaron Defazio, Leon Bottou

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Neural Information Processing SystemsFeb-12-2026, 19:32:42 GMT

SVR methods use control variates to reduce the variance of the traditional stochastic gradient descent (SGD) estimate f0i(w) of the full gradient f0(w). Control variates are a classical technique for reducing the variance of a stochastic quantity without introducing bias. Say we have some random variable X.

artificial intelligence, machine learning, variance reduction, (15 more...)

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Neural Information Processing SystemsFeb-11-2026, 06:46:15 GMT

Piper: MultidimensionalPlanner forDNNParallelization

In the "modern era", such model-parallel training techniques trace their roots back to AlexNet [14] and early influential systems such as DistBelief [6] and Project Adam [3].

artificial intelligence, machine learning, parallelism, (17 more...)

Country:

North America > United States (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.05)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsDec-26-2025, 10:29:38 GMT

A fast heuristic to optimize time-space tradeoff for large models

Training large-scale neural networks is heavily constrained by GPU memory. In order to circumvent this limitation, gradient checkpointing, or recomputation is a powerful technique. There is active research in this area with methods such as Checkmake or Moccasin. However, both Checkmate and Moccasin rely on mixed integer linear programming or constraint programming, resulting in limited scalability due to their exponentially large search space.This paper proposes a novel algorithm for recomputation (FastSA) based on a simulated annealing heuristic that achieves comparable or even better solutions than state-of-the-art alternatives. FastSA can optimize computational graphs with thousands of nodes within 3 to 30 seconds, several orders of magnitude faster than current solutions.We applied FastSA to PyTorch models and verified its effectiveness through popular large vision and text models, including recent language models with the transformer architecture. The results demonstrate significant memory reductions by 73% with extra 18% computational overheads on average. Our experiments demonstrate the practicality and effectiveness of our recomputation algorithm, further highlighting its potential for wide application in various deep learning domains.

fast heuristic, name change, optimize time-space tradeoff, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

arXiv.org Artificial IntelligenceNov-26-2025

LEANN: A Low-Storage Vector Index

Wang, Yichuan, Li, Zhifei, Liu, Shu, Wu, Yongji, Mao, Ziming, Zhao, Yilong, Yan, Xiao, Xu, Zhiying, Zhou, Yang, Stoica, Ion, Min, Sewon, Zaharia, Matei, Gonzalez, Joseph E.

Embedding-based vector search underpins many important applications, such as recommendation and retrieval-augmented generation (RAG). It relies on vector indices to enable efficient search. However, these indices require storing high-dimensional embeddings and large index metadata, whose total size can be several times larger than the original data (e.g., text chunks). Such high storage overhead makes it difficult, or even impractical, to deploy vector search on personal devices or large-scale datasets. To tackle this problem, we propose LEANN, a storage-efficient index for vector search that recomputes embeddings on the fly instead of storing them, and compresses state-of-the-art proximity graph indices while preserving search accuracy. LEANN delivers high-quality vector search while using only a fraction of the storage (e.g., 5% of the original data) and supporting storage-efficient index construction and updates. On real-world benchmarks, LEANN reduces index size by up to 50x compared with conventional indices, while maintaining SOTA accuracy and comparable latency for RAG applications.

data mining, machine learning, node, (21 more...)

2506.08276

Country: North America > United States > California > Santa Clara County > Santa Clara (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

arXiv.org Artificial IntelligenceNov-26-2025

STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training with Spatio-Temporal Planning

Huang, Zixiao, Hu, Junhao, Lin, Hao, Zhu, Chunyang, Tang, Yueran, Zhang, Quanlu, Guo, Zhen, Li, Zhenhua, Yan, Shengen, Zhu, Zhenhua, Dai, Guohao, Wang, Yu

The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Such fragmentation stems from the use of online GPU memory allocators in popular deep learning frameworks like PyTorch, which disregard tensor lifespans. As a result, this inefficiency can waste as much as 43% of memory and trigger out-of-memory errors, undermining the effectiveness of optimization methods. To address this, we introduce STAlloc, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STAlloc introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch memory allocator, STAlloc reduces fragmentation ratio on average by 85.1% (up to 100%) across both dense and MoE models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves throughput performance by up to 32.5%.

large language model, machine learning, natural language, (20 more...)

doi: 10.1145/3767295.3769335

2507.16274

Country:

Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.05)
Asia > China > Beijing > Beijing (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre:

Workflow (1.00)
Research Report (1.00)

Industry: Health & Medicine > Consumer Health (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-14-2025

CacheClip: Accelerating RAG with Effective KV Cache Reuse

Yang, Bin, Leng, Qiuyu, Zeng, Jun, Wu, Zhenhua

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2510.10129

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Neural Information Processing SystemsOct-9-2025, 02:36:59 GMT

9be39b35906526b8d240056daac72c6f-Paper-Conference.pdf

large language model, machine learning, natural language, (18 more...)

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > Promising Solution (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Software (0.93)
(2 more...)

Li, Shigang, Hoefler, Torsten

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

arXiv.org Artificial IntelligenceAug-22-2025

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

chimera, machine learning, natural language, (18 more...)

doi: 10.1145/3458817.3476145

2107.06925

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Missouri > St. Louis County > St. Louis (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceAug-19-2025

Sparse Attention across Multiple-context KV Cache

Cao, Ziyi, Si, Qingyi, Zhang, Jingbin, Liu, Bingquan

Large language models face significant cost challenges in long-sequence inference. To address this, reusing historical Key-Value (KV) Cache for improved inference efficiency has become a mainstream approach. Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache, thereby reducing sequence length. However, such techniques are limited to single-context scenarios, where historical KV Cache is computed sequentially with causal-attention dependencies. In retrieval-augmented generation (RAG) scenarios, where retrieved documents as context are unknown beforehand, each document's KV Cache is computed and stored independently (termed multiple-context KV Cache), lacking cross-attention between contexts. This renders existing methods ineffective. Although prior work partially recomputes multiple-context KV Cache to mitigate accuracy loss from missing cross-attention, it requires retaining all KV Cache throughout, failing to reduce memory overhead. This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache. Specifically, SamKV takes into account the complementary information of other contexts when sparsifying one context, and then locally recomputes the sparsified information. Experiments demonstrate that our method compresses sequence length to 15% without accuracy degradation compared with full-recompuation baselines, significantly boosting throughput in multi-context RAG scenarios.

kv cache, large language model, machine learning, (16 more...)