Goto

Collaborating Authors

 ssd


HiFC: High-efficiency Flash-based KVCache Swapping for Scaling LLMInference

Neural Information Processing Systems

Large-language-model inference with long contexts often produces key-value (KV) caches whose footprint exceeds the capacity of high-bandwidth memory on a GPU. Prior LLM inference frameworks such as vLLM mitigate this pressure by swapping KV cache pages to host DRAM. However, the high cost of large DRAM pools makes this solution economically unattractive. Although offloading to SSDs can be a cost-effective way to expand memory capacity relative to DRAM, conventional frameworks such as FlexGen experience a substantial throughput drop since the data path that routes SSD traffic through CPU to GPU is severely bandwidth-constrained. To overcome these limitations, we introduce HiFC, a novel DRAM-free swapping scheme that enables direct access to SSD-resident memory with low latency and high effective bandwidth. HiFC stores KV pages in pseudoSLC (pSLC) regions of commodity NVMe SSDs, sustaining high throughput under sequential I/O and improving write endurance by up to 8 . Leveraging GPU Direct Storage, HiFC enables direct transfers between SSD and GPU, bypassing host DRAM and alleviating PCIe bottlenecks. HiFC employs fine-grained block mapping to confine writes to high-performance pSLC zones, stabilizing latency and throughput under load. HiFC achieves inference throughput comparable to DRAMbased swapping under diverse long-context workloads, such as NarrativeQA, while significantly lowering the memory expansion cost of a GPU server system by 4.5 over three years.


Cost-Efficient LLMTraining with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Neural Information Processing Systems

We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TERAIO, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each LLM training iteration, the inactive tensors are usually large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching tensors to/from slow SSDs without stalling the GPU training process. TERAIO accurately estimates the lifetime (active period of time in GPU memory) of each tensor with the profiling of the first few iterations in the training process. With the tensor lifetime analysis, TERAIO will generate an optimized tensor offloading/prefetching plan and integrate it into the compiled LLM program via PyTorch. TERAIO has a runtime tensor migration engine to execute the offloading/prefetching plan via GPUDirect storage, which allows direct tensor migration between GPUs and SSDs for alleviating the CPU bottleneck and maximizing the SSD bandwidth utilization. In comparison with state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we show that TERAIO improves the training performance of various LLMs by 1.47 on average, and achieves 80.7% of the ideal performance assuming unlimited GPU memory.


Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Neural Information Processing Systems

We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TERAIO, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each LLM training iteration, the inactive tensors are usually large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching tensors to/from slow SSDs without stalling the GPU training process. TERAIO accurately estimates the lifetime (active period of time in GPU memory) of each tensor with the profiling of the first few iterations in the training process. With the tensor lifetime analysis, TERAIO will generate an optimized tensor offloading/prefetching plan and integrate it into the compiled LLM program via PyTorch. TERAIO has a runtime tensor migration engine to execute the offloading/prefetching plan via GPUDirect storage, which allows direct tensor migration between GPUs and SSDs for alleviating the CPU bottleneck and maximizing the SSD bandwidth utilization. In comparison with state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we show that TERAIO improves the training performance of various LLMs by 1.47 on average, and achieves 80.7% of the ideal performance assuming unlimited GPU memory.


Satechi DotDisk SSD enclosure review: Svelte, fan-cooled 80Gbps storage

PCWorld

When you purchase through links in our articles, we may earn a small commission. The DotDisk is by far the most portable 80Gbps enclosure I've reviewed thanks to fan cooling. Despite its diminutive size, you never need to worry about thermal throttling with the handsome Satechi DotDisk 80Gbps enclosure thanks to an internal fan. But it's pricey -- as are all 80Gbps enclosures. It's uniquely portable, as it employs active cooling in the form of a small fan rather than the bulky passive cooling fins most 80Gbps enclosures feature. Yet, despite its svelte profile, the DotDisk proved a top-flight performer.


Two SSDs are better than one in your PC. Here's why

PCWorld

PCWorld explains how using two SSDs in your PC can significantly boost performance by separating the operating system from applications and data across different drives. This setup prevents bandwidth competition during demanding tasks and offers better data protection through individual drive encryption capabilities.




Best Cyber Monday Desktop Computer Deals 2025 (and the top Black Friday offers still available)

PCWorld

When you purchase through links in our articles, we may earn a small commission. From gaming PCs to mainstream all-in-ones, Cyber Monday should include solid deals for PC bargain hunters. Amazon Cyber Monday deals are still going strong through the weekend and the sales are well underway. Retailers are offering killer discounts on everything from home-office PCs to decked-out gaming rigs and sleek all-in-ones. Still, not all computer deals are built the same.


Amazon just unleashed its Cyber Monday laptop deals and it's dropping prices on MacBooks, gaming PCs, and more

Popular Science

Gear Computers Laptops Amazon just unleashed its Cyber Monday laptop deals and it's dropping prices on MacBooks, gaming PCs, and more Whether you need a basic everyday driver or a full-featured gaming PC, Amazon's Cyber Monday laptop can save you cash. We may earn revenue from the products available on this page and participate in affiliate programs. A laptop is a big investment. Not only do they typically cost a lot of money, but you're committing a machine you'll stare at while you shop, do homework, remote work, game, and pretty much everything else in your online life. Amazon just dropped its Cyber Monday deals on laptops and these are some of the lowest prices we have seen all year.


Stochastic Spectral and Conjugate Descent Methods

Neural Information Processing Systems

An increasing array of learning and training tasks reduce to optimization problem in very large dimensions. The state-of-the-art algorithms in this regime are based on randomized coordinate descent (RCD) . V arious acceleration strategies were proposed for RCD in the literature in recent years, based on techniques such as Nesterov's momentum [