Goto

Collaborating Authors

 Memory


AMD Radeon RX 9070 and 9070 XT review: The new 1440p gaming champions

PCWorld

Some software bugs mar the experience but overall, AMD's 9070 graphics cards offer such a compelling mix of performance, value, and memory capacity that it's worth accepting those quibbles. Nvidia fumbled the ball with its 549 GeForce RTX 5070, and AMD's new Radeon RX 9070 and 9070 XT are primed to seize advantage. The RTX 5070, hitting store shelves today, is a good 1440p graphics card but a stagnant generational sidegrade at best. Enter the 549 Radeon RX 9070 and 599 Radeon RX 9070 XT, launching tomorrow. Both cards are faster than the RTX 5070, with the 9070 XT going toe-to-toe with the 750 RTX 5070 Ti in many games, and each includes an ample 16GB of VRAM.


Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes

Neural Information Processing Systems

Recently, several works have shown that natural modifications of the classical conditional gradient method (aka Frank-Wolfe algorithm) for constrained convex optimization, provably converge with a linear rate when the feasible set is a polytope, and the objective is smooth and strongly-convex. However, all of these results suffer from two significant shortcomings: i) large memory requirement due to the need to store an explicit convex decomposition of the current iterate, and as a consequence, large running-time overhead per iteration ii) the worst case convergence rate depends unfavorably on the dimension In this work we present a new conditional gradient variant and a corresponding analysis that improves on both of the above shortcomings. In particular, both memory and computation overheads are only linear in the dimension, and in addition, in case the optimal solution is sparse, the new convergence rate replaces a factor which is at least linear in the dimension in previous works, with a linear dependence on the number of non-zeros in the optimal solution At the heart of our method, and corresponding analysis, is a novel way to compute decomposition-invariant away-steps. While our theoretical guarantees do not apply to any polytope, they apply to several important structured polytopes that capture central concepts such as paths in graphs, perfect matchings in bipartite graphs, marginal distributions that arise in structured prediction tasks, and more. Our theoretical findings are complemented by empirical evidence that shows that our method delivers state-of-the-art performance.


on a memory economical calculation, while its vanilla multi-key counterpart is less memory efficient when achieving

Neural Information Processing Systems

Thank you for acknowledging the key contributions of our paper. R1.2 Generalize to video: As suggested, we conducted additional The top-1 accuracy of JCL pre-trained features is 48.6%, which outperforms MoCo v2 (47.3%). Generalization of JCL for other data modalities (sound, language, video) will be included in our future work. Regarding your concerns of the written quality and typos (e.g., Algorithm 1 The top-1 accuracy on ImageNet100 for vanilla (ResNet-50) is 80.9% while JCL achieves 82.0%. R2.3 SimCLR: The top-5 accuracy we reported (87.3%) for SimCLR was extracted from the Thus, there is no one-one correspondence between the data in Table1 and Figure2.


Reviews: Large Memory Layers with Product Keys

Neural Information Processing Systems

UPDATE: Authors answered my questions, I would like to keep my score unchanged and suggest to focus on clarity of the final version. Perhaps, this is the case when I would really be interested in looking at the source code. Originality: the paper borrows the general idea of product keys from the database community, however the application to fast retrieval in neural memory systems seems quite novel to me. Quality: The core ideas of the paper are sound, however more I would appreciate more rigor in both conceptual and experimental comparison with other approaches incorporating memory to Transformer (see e.g. Another suggestion would be to discuss more the issue of potential non-uniformity of the query distribution, which indeed seems to be quite relevant.


corresponding modifications in the revised paper. 32GB of RAM, it takes 65 seconds to estimate the O(|V |

Neural Information Processing Systems

We thank the reviewers for their valuable feedback. R2 and R3 had questions about the time complexity of our method. As noted in Appendix A, this computation can be amortized across many goal-reaching tasks. Lastly, we agree with R2 that the construction of "good" replay buffers is an We will clarify this in Section 2.3. We will clarify this in Alg. 1.


Managed-Retention Memory: A New Class of Memory for the AI Era

arXiv.org Artificial Intelligence

AI clusters today are one of the major uses of High Bandwidth Memory (HBM). However, HBM is suboptimal for AI workloads for several reasons. Analysis shows HBM is overprovisioned on write performance, but underprovisioned on density and read bandwidth, and also has significant energy per bit overheads. It is also expensive, with lower yield than DRAM due to manufacturing complexity. We propose a new memory class: Managed-Retention Memory (MRM), which is more optimized to store key data structures for AI inference workloads. We believe that MRM may finally provide a path to viability for technologies that were originally proposed to support Storage Class Memory (SCM). These technologies traditionally offered long-term persistence (10+ years) but provided poor IO performance and/or endurance. MRM makes different trade-offs, and by understanding the workload IO patterns, MRM foregoes long-term data retention and write performance for better potential performance on the metrics important for these workloads.


Best of CES 2025: The PC and home tech that blew us away

PCWorld

You never know what you're going to get with CES. Of course, we knew we'd hear a lot about AI -- check -- and that there'd be announcements of new CPUs and GPUs -- also check. But you just never know how the all the pomp and hoo-ha of this annual mega tech event is going to pay off in the real-world, for regular consumers. Does the average PC user have something to be excited about now that the veil has come off of this year's product launches? If the PCWorld staff is any indication, the answer is yes!


Input-Based Ensemble-Learning Method for Dynamic Memory Configuration of Serverless Computing Functions

arXiv.org Artificial Intelligence

In today's Function-as-a-Service offerings, a programmer is usually responsible for configuring function memory for its successful execution, which allocates proportional function resources such as CPU and network. However, right-sizing the function memory force developers to speculate performance and make ad-hoc configuration decisions. Recent research has highlighted that a function's input characteristics, such as input size, type and number of inputs, significantly impact its resource demand, run-time performance and costs with fluctuating workloads. This correlation further makes memory configuration a non-trivial task. On that account, an input-aware function memory allocator not only improves developer productivity by completely hiding resource-related decisions but also drives an opportunity to reduce resource wastage and offer a finer-grained cost-optimised pricing scheme. Therefore, we present MemFigLess, a serverless solution that estimates the memory requirement of a serverless function with input-awareness. The framework executes function profiling in an offline stage and trains a multi-output Random Forest Regression model on the collected metrics to invoke input-aware optimal configurations. We evaluate our work with the state-of-the-art approaches on AWS Lambda service to find that MemFigLess is able to capture the input-aware resource relationships and allocate upto 82% less resources and save up to 87% run-time costs.


Apple MacBook Pro M4 review: faster, better and cheaper

The Guardian

Apple's upgraded MacBook Pro for 2024 gets a significant power boost with the M4 chip, double the memory as standard, even longer battery life and a price cut, ending the year on a high. The Guardian's journalism is independent. We will earn a commission if you buy something through an affiliate link. The longstanding laptop line now starts at 1,599 ( 1,899/ 1,599/A 2,499), making it 100 or so cheaper than last year's M3 models. Though still an expensive, premium laptop, it comes with at least 16GB of RAM rather than 8GB, which was an upgrade worth paying extra for on previous models. The outside hasn't changed from its predecessor.


vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

arXiv.org Artificial Intelligence

Efficient management of GPU memory is essential for high throughput LLM inference. Prior systems used to reserve KV-cache memory ahead-of-time that resulted in wasted capacity due to internal fragmentation. Inspired by demand paging, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation and improves serving throughout. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. As a consequence, one needs to rewrite the attention kernels to support paging, and implement a memory manager in the serving framework. This results in both performance and programming overheads, as well as portability challenges in adopting state-of-the-art attention kernels. In this paper, we propose vAttention, a new approach for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention stores KV-cache in contiguous virtual memory and leverages OS support for on-demand allocation of physical memory. vAttention thus enables one to use state-of-the art attention kernels out-of-the-box by adding support for dynamic allocation of physical memory without having to re-write their code. We implement vAttention in the vLLM serving stack to show that it also helps improve decode throughput by up to 1.99x over vLLM, and the end-to-end serving throughput by up to 1.22x and 1.29x, compared to using the state-of-the-art PagedAttention based kernels of FlashAttention and FlashInfer.