Goto

Collaborating Authors

 offload


Efficient Combination of Rematerialization and Offloading for Training DNNs

Neural Information Processing Systems

Rematerialization and offloading are two well known strategies to save memory during the training phase of deep neural networks, allowing data scientists to consider larger models, batch sizes or higher resolution data.


You Won't Be Able to Offload Your Holiday Shopping to AI Agents Anytime Soon

WIRED

You Won't Be Able to Offload Your Holiday Shopping to AI Agents Anytime Soon Chatbot developers and retail giants are battling over user data as they lay the foundation for a future in which AI agents can do all your online shopping for you. Ask OpenAI's ChatGPT about a product on Etsy, and chances are you can enter your payment details and buy it without ever leaving the app. Instant Checkout was one of the first features to emerge from a recent wave of partnerships between leading AI and ecommerce companies. The aim is to encourage people to hand off parts of the browsing and ordering experience to AI tools and usher in an era of agentic shopping. But while these so-called agents have started to become more commonplace, they are far from taking over as full-time virtual buyers. OpenAI, Google, Amazon, and other AI chatbot developers are still negotiating with major retail partners on the best way to limit costly mistakes by agents and the amount of product data and chat history that have to be exchanged to make these agents successful, according to executives at seven tech and ecommerce companies who spoke with WIRED.


Expanding the Action Space of LLMs to Reason Beyond Language

Yue, Zhongqi, Wang, Weishi, Zhan, Yundaichuan, Li, Juncheng, Dahlmeier, Daniel, Johansson, Fredrik D.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.


Efficient Combination of Rematerialization and Offloading for Training DNNs

Neural Information Processing Systems

Rematerialization and offloading are two well known strategies to save memory during the training phase of deep neural networks, allowing data scientists to consider larger models, batch sizes or higher resolution data.


Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

Bekman, Stas, Rajbhandari, Samyam, Wyatt, Michael, Rasley, Jeff, Ruwase, Tunji, Yao, Zhewei, Qiao, Aurick, He, Yuxiong

arXiv.org Artificial Intelligence

Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible. We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models. ALST supports training Meta's Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-pallellism/ and Arctic Training https://github.com/snowflakedb/ArcticTraining/blob/main/projects/sequence-parallelism/README.md.


SplitReason: Learning To Offload Reasoning

Akhauri, Yash, Fei, Anthony, Chang, Chi-Chih, AbouElhamayed, Ahmed F., Li, Yueying, Abdelfattah, Mohamed S.

arXiv.org Artificial Intelligence

Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks. This extended generation length reflects the multi-step, compositional nature of reasoning and is often correlated with higher solution accuracy. From an efficiency perspective, longer token generation exacerbates the inherently sequential and memory-bound decoding phase of LLMs. However, not all parts of this expensive reasoning process are equally difficult to generate. We leverage this observation by offloading only the most challenging parts of the reasoning process to a larger, more capable model, while performing most of the generation with a smaller, more efficient model; furthermore, we teach the smaller model to identify these difficult segments and independently trigger offloading when needed. To enable this behavior, we annotate difficult segments across 18k reasoning traces from the OpenR1-Math-220k chain-of-thought (CoT) dataset. We then apply supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to a 1.5B-parameter reasoning model, training it to learn to offload the most challenging parts of its own reasoning process to a larger model. This approach improves AIME24 reasoning accuracy by 24% and 28.3% while offloading 1.35% and 5% of the generated tokens respectively. We open-source our SplitReason model, data, code and logs.


PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Wan, Xinyi, Qi, Penghui, Huang, Guangxing, Li, Jialin, Lin, Min

arXiv.org Artificial Intelligence

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.


HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

Luo, Cheng, Cai, Zefan, Sun, Hanshi, Xiao, Jinqi, Yuan, Bo, Xiao, Wen, Hu, Junjie, Zhao, Jiawei, Chen, Beidi, Anandkumar, Anima

arXiv.org Artificial Intelligence

Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU memory usage from 207 GB to 17 GB, achieving a 92% reduction compared to BF16 baseline inference. Notably, HEADINFER enables 4-million-token inference with an 8B model on a single consumer GPU with 24GB memory (e.g., NVIDIA RTX 4090) without approximation methods.


ColA: Collaborative Adaptation with Gradient Learning

Diao, Enmao, Le, Qi, Wu, Suya, Wang, Xinran, Anwar, Ali, Ding, Jie, Tarokh, Vahid

arXiv.org Artificial Intelligence

A primary function of back-propagation is to compute both the gradient of hidden representations and parameters for optimization with gradient descent. Training large models requires high computational costs due to their vast parameter sizes. While Parameter-Efficient Fine-Tuning (PEFT) methods aim to train smaller auxiliary models to save computational space, they still present computational overheads, especially in Fine-Tuning as a Service (FTaaS) for numerous users. We introduce Collaborative Adaptation (ColA) with Gradient Learning (GL), a parameter-free, model-agnostic fine-tuning approach that decouples the computation of the gradient of hidden representations and parameters. In comparison to PEFT methods, ColA facilitates more cost-effective FTaaS by offloading the computation of the gradient to low-cost devices. We also provide a theoretical analysis of ColA and experimentally demonstrate that ColA can perform on par or better than existing PEFT methods on various benchmarks.


Emergent Communication Protocol Learning for Task Offloading in Industrial Internet of Things

Mostafa, Salwa, Mota, Mateus P., Valcarce, Alvaro, Bennis, Mehdi

arXiv.org Artificial Intelligence

In this paper, we leverage a multi-agent reinforcement learning (MARL) framework to jointly learn a computation offloading decision and multichannel access policy with corresponding signaling. Specifically, the base station and industrial Internet of Things mobile devices are reinforcement learning agents that need to cooperate to execute their computation tasks within a deadline constraint. We adopt an emergent communication protocol learning framework to solve this problem. The numerical results illustrate the effectiveness of emergent communication in improving the channel access success rate and the number of successfully computed tasks compared to contention-based, contention-free, and no-communication approaches. Moreover, the proposed task offloading policy outperforms remote and local computation baselines.