Goto

Collaborating Authors

 vram


DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

Feldmann, Casimir, Wilder-Smith, Maximum, Patil, Vaishakh, Oechsle, Michael, Niemeyer, Michael, Tateno, Keisuke, Hutter, Marco

arXiv.org Artificial Intelligence

Abstract--Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods. ECENT advances in neural representations for 3D scene reconstruction have revolutionized novel view synthesis, with 3D Gaussian Splatting (3DGS) [1] emerging as an exceptionally efficient and high-quality approach. Unlike volume-based methods [2]-[4] that struggle with rendering speed due to expensive ray marching, 3DGS provides real-time rendering capabilities while maintaining impressive visual fidelity.


A LoD of Gaussians: Unified Training and Rendering for Ultra-Large Scale Reconstruction with External Memory

Windisch, Felix, Köhler, Thomas, Radl, Lukas, Steiner, Michael, Schmalstieg, Dieter, Steinberger, Markus

arXiv.org Artificial Intelligence

Gaussian Splatting has emerged as a high-performance technique for novel view synthesis, enabling real-time rendering and high-quality reconstruction of small scenes. However, scaling to larger environments has so far relied on partitioning the scene into chunks -- a strategy that introduces artifacts at chunk boundaries, complicates training across varying scales, and is poorly suited to unstructured scenarios such as city-scale flyovers combined with street-level views. Moreover, rendering remains fundamentally limited by GPU memory, as all visible chunks must reside in VRAM simultaneously. We introduce A LoD of Gaussians, a framework for training and rendering ultra-large-scale Gaussian scenes on a single consumer-grade GPU -- without partitioning. Our method stores the full scene out-of-core (e.g., in CPU memory) and trains a Level-of-Detail (LoD) representation directly, dynamically streaming only the relevant Gaussians. A hybrid data structure combining Gaussian hierarchies with Sequential Point Trees enables efficient, view-dependent LoD selection, while a lightweight caching and view scheduling system exploits temporal coherence to support real-time streaming and rendering. Together, these innovations enable seamless multi-scale reconstruction and interactive visualization of complex scenes -- from broad aerial views to fine-grained ground-level details.


Prima.cpp: Fast 30-70B LLM Inference on Heterogeneous and Low-Resource Home Clusters

Li, Zonghang, Li, Tao, Feng, Wenjiao, Xiao, Rongxing, She, Jianshu, Huang, Hong, Guizani, Mohsen, Yu, Hongfang, Ho, Qirong, Xiang, Wei, Liu, Steve

arXiv.org Artificial Intelligence

On-device inference offers privacy, offline use, and instant response, but consumer hardware restricts large language models (LLMs) to low throughput and capability. To overcome this challenge, we present prima.cpp, a distributed on-device inference system that runs 30-70B LLMs on consumer home clusters with mixed CPUs/GPUs, insufficient RAM/VRAM, slow disks, Wi-Fi links, and heterogeneous OSs. We introduce pipelined-ring parallelism (PRP) to overlap disk I/O with compute and communication, and address the prefetch-release conflict in mmap-based offloading. We further propose Halda, a heterogeneity-aware scheduler that co-optimizes per-device CPU/GPU workloads and device selection under RAM/VRAM constraints. On four consumer home devices, a 70B model reaches 674 ms/token TPOT with <6% memory pressure, and a 32B model with speculative decoding achieves 26 tokens/s. Compared with llama.cpp, exo, and dllama, our proposed prima.cpp achieves 5-17x lower TPOT, supports fine-grained model sizes from 8B to 70B, ensures broader cross-OS and quantization compatibility, and remains OOM-free, while also being Wi-Fi tolerant, privacy-preserving, and hardware-independent. The code is available at https://gitee.com/zonghang-li/prima.cpp.


MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

Chen, Jiarui, Chen, Yikeng, Zou, Yingshuang, Huang, Ye, Wang, Peng, Liu, Yuan, Sun, Yujing, Wang, Wenping

arXiv.org Artificial Intelligence

3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS$^{2}$, a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we replace the memory-intensive spherical harmonics with lightweight, arbitrarily oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGS$^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality. Project page: https://megs-2.github.io/


Intel's new configurable VRAM option gives Core laptops an AI boost

PCWorld

For many months, AMD offered a special treat to enthusiasts wishing to run AI chatbot LLMs on their PCs: configurable VRAM that significantly improved performance. Now Intel can say the same. Bob Duffy, who oversees Intel's AI Playground application for running AI art and local chatbots on your PC, tweeted that the company's latest Arc driver for its integrated GPUs now offers a "shared GPU memory override" that offers the ability to adjust your PC's VRAM, provided that you have a supported processor. This is a big deal for AI and even some games, though not an obvious one. If you owned an Intel Core laptop with 32GB of memory, 16GB of it would be assigned to AI and games.


Framework Desktop review: A powerful AI PC, made with love

PCWorld

The Framework Desktop DIY Edition is a thoughtfully engineered small-form-factor desktop PC that is both an entry point into enthusiast computing as well as a powerful AI desktop in its own right. The Framework Desktop DIY Edition is unique: a do-it-yourself desktop without the complexity of building from scratch, forming a compact, personalized "AI workstation." If you're nervous about a less-familiar brand, don't be. Multiple photos show how to tighten a thumbscrew–that's how comfortable they want you to be. I can point to a few things that I thought needed improvement: soldered memory, a beta driver bundle that should be finalized by the time you buy it, and a top panel which didn't clip in as easily as I would have liked. Inserting the SSD stressed me out a bit, too. But Framework's eye for customization (colored tiles you can design and install yourself, plus your choice of I/O) lends itself to fun and productivity. The AMD Ryzen AI Max (Strix Halo) chip inside is slightly out of the ordinary, with its do-everything design. I have high praise for the Framework Desktop, and think you will too.


Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization

Chaubard, Francois, Kochenderfer, Mykel

arXiv.org Artificial Intelligence

During inference, Recurrent Neural Networks (RNNs) scale constant in both FLOPs and GPU memory with increasing context length, as they compress all prior tokens into a fixed-size memory. In contrast, transformers scale linearly in FLOPs and, at best, linearly in memory during generation, since they must attend to all previous tokens explicitly. Despite this inference-time advantage, training large RNNs on long contexts remains impractical because standard optimization methods depend on Backpropagation Through Time (BPTT). BPTT requires retention of all intermediate activations during the forward pass, causing memory usage to scale linearly with both context length and model size. In this paper, we show that Zero-Order Optimization (ZOO) methods such as Random-vector Gradient Estimation (RGE) can successfully replace BPTT to train RNNs with convergence rates that match, or exceed BPTT by up to 19 fold, while using orders of magnitude less memory and cost, as the model remains in inference mode throughout training. We further demonstrate that Central-Difference RGE (CD-RGE) corresponds to optimizing a smoothed surrogate loss, inherently regularizing training and improving generalization. Our method matches or outperforms BPTT across three settings: (1) overfitting, (2) transduction, and (3) language modeling. Across all tasks, with sufficient perturbations, our models generalize as well as or better than those trained with BPTT, often in fewer steps. Despite the need for more forward passes per step, we can surpass BPTT wall-clock time per step using recent advancements such as FlashRNN and distributed inference.


Mixture of Lookup Experts

Jie, Shibo, Tang, Yehui, Han, Kai, Li, Yitong, Tang, Duyu, Deng, Zhi-Hong, Wang, Yunhe

arXiv.org Artificial Intelligence

Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.


Boost AMD's Ryzen AI Max performance up to 60% with this memory trick

PCWorld

If you've purchased a laptop or tablet with an AMD Ryzen chip inside, there's a performance tweak you absolutely need to know about. Savvy gamers know instinctively that you can boost your game's frame rate by lowering the resolution or the visual quality, or by making an adjustment to the Windows power-performance slider. But the Ryzen AI Max is a new kind of device: a killer mobile processor that can run modern games at elevated frame rates, and serve as an AI powerhouse. A simple adjustment of the Ryzen AI Max's unified frame buffer, or available graphics memory. While it's a simple fix, in my tests, it made an enormous difference: up to a 60 percent performance boost in some cases.


Adobe Firefly muscles into AI video–here's what it looks like

PCWorld

Adobe said today that it's bringing AI-generated video, aka the Firefly Video Model, to Adobe Premiere Pro plus its Firefly generative art service. Unlike its generative AI image capabilities, however, it won't be free. AI-generated video has been available for months. In December, OpenAI released Sora, its ability to craft AI video clips of several seconds from a text prompt. What Adobe is offering is authenticity.