AITopics | hbm

Collaborating Authors

hbm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

Fang, Yunhua, Xie, Rui, Haq, Asad Ul, Ma, Linsen, Maghraoui, Kaoutar El, Wang, Naigang, Wang, Meng, Liu, Liu, Zhang, Tong

arXiv.org Artificial IntelligenceSep-16-2025

Abstract--Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. T o our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.13231

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Information Technology (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Tabular foundation model for GEOAI benchmark problems BM/AirportSoilProperties/2/2025

Saito, Taiga, Otake, Yu, Wu, Stephen

arXiv.org Artificial IntelligenceSep-4-2025

This paper presents a novel application of the Tabular Prior-Data Fitted Network (TabPFN) - a transformer-based foundation model for tabular data - to geotechnical site characterization problems defined in the GEOAI benchmark BM/AirportSoilProperties/2/2025. Two tasks are addressed: (1) predicting the spatial variation of undrained shear strength (su) across borehole depth profiles, and (2) imputing missing mechanical parameters in a dense-site dataset. We apply TabPFN in a zero-training, few-shot, in-context learning setting - without hyper-parameter tuning - and provide it with additional context from the big indirect database (BID). The study demonstrates that TabPFN, as a general-purpose foundation model, achieved superior accuracy and well-calibrated predictive distributions compared to a conventional hierarchical Bayesian model (HBM) baseline, while also offering significant gains in inference efficiency. In Benchmark Problem #1 (spatial su prediction), TabPFN outperformed the HBM in prediction accuracy and delivered an order-of-magnitude faster runtime. In Benchmark Problem #2 (missing mechanical parameter imputation), TabPFN likewise achieved lower RMSE for all target parameters with well-quantified uncertainties, though its cumulative computation cost was higher than HBM's due to its one-variable-at-a-time inference. These results mark the first successful use of a tabular foundation model in geotechnical modeling, suggesting a potential paradigm shift in probabilistic site characterization.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.03191

Country: Asia > Japan > Honshū (0.46)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

GeoMaNO: Geometric Mamba Neural Operator for Partial Differential Equations

Han, Xi, Zhang, Jingwei, Samaras, Dimitris, Hou, Fei, Qin, Hong

arXiv.org Artificial IntelligenceMay-20-2025

The neural operator (NO) framework has emerged as a powerful tool for solving partial differential equations (PDEs). Recent NOs are dominated by the Transformer architecture, which offers NOs the capability to capture long-range dependencies in PDE dynamics. However, existing Transformer-based NOs suffer from quadratic complexity, lack geometric rigor, and thus suffer from sub-optimal performance on regular grids. As a remedy, we propose the Geometric Mamba Neural Operator (GeoMaNO) framework, which empowers NOs with Mamba's modeling capability, linear complexity, plus geometric rigor. We evaluate GeoMaNO's performance on multiple standard and popularly employed PDE benchmarks, spanning from Darcy flow problems to Navier-Stokes problems. GeoMaNO improves existing baselines in solution operator approximation by as much as 58.9%.

artificial intelligence, machine learning, operator, (15 more...)

arXiv.org Artificial Intelligence

2505.1202

Country: North America > United States (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Vision (0.93)

Add feedback

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

Zhang, Hang, Shi, Jiuchen, Wang, Yixiao, Chen, Quan, Shan, Yizhou, Guo, Minyi

arXiv.org Artificial IntelligenceMay-8-2025

Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.03756

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study

Ahmed, Tamim, Rikakis, Thanassis

arXiv.org Artificial IntelligenceMay-6-2025

Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable. We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations. Our approach employs multi-view data (ipsilateral, contralateral, and top perspectives), applying early and late fusion to combine features across views and models. Hierarchical Bayesian Models (HBMs) infer movement quality components, enhancing interpretability. A clinician dashboard displays task scores, execution times, and quality assessments. We conducted a study with five clinicians who reviewed 500 video ratings generated by our system, providing feedback on its accuracy and usability. Evaluated on a stroke rehabilitation dataset, our framework achieves 89.0% validation accuracy with late fusion, with HBMs aligning closely with manual assessments. This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.

artificial intelligence, assessment, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2505.0168

Country: North America > United States > California (0.17)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Add feedback

AdaSplash: Adaptive Sparse Flash Attention

Gonçalves, Nuno, Treviso, Marcos, Martins, André F. T.

arXiv.org Artificial IntelligenceFeb-17-2025

The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches -- and in some cases surpasses -- the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.

large language model, machine learning, plash, (21 more...)

arXiv.org Artificial Intelligence

2502.12082

Country:

Europe (1.00)
North America > United States > New York (0.28)

Genre: Research Report (0.40)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Managed-Retention Memory: A New Class of Memory for the AI Era

Legtchenko, Sergey, Stefanovici, Ioan, Black, Richard, Rowstron, Antony, Liu, Junyi, Costa, Paolo, Canakci, Burcu, Narayanan, Dushyanth, Wu, Xingbo

arXiv.org Artificial IntelligenceJan-16-2025

AI clusters today are one of the major uses of High Bandwidth Memory (HBM). However, HBM is suboptimal for AI workloads for several reasons. Analysis shows HBM is overprovisioned on write performance, but underprovisioned on density and read bandwidth, and also has significant energy per bit overheads. It is also expensive, with lower yield than DRAM due to manufacturing complexity. We propose a new memory class: Managed-Retention Memory (MRM), which is more optimized to store key data structures for AI inference workloads. We believe that MRM may finally provide a path to viability for technologies that were originally proposed to support Storage Class Memory (SCM). These technologies traditionally offered long-term persistence (10+ years) but provided poor IO performance and/or endurance. MRM makes different trade-offs, and by understanding the workload IO patterns, MRM foregoes long-term data retention and write performance for better potential performance on the metrics important for these workloads.

managed-retention memory, requirement, workload, (15 more...)

arXiv.org Artificial Intelligence

2501.09605

Country:

North America > United States > New York > New York County > New York City (0.04)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(5 more...)

Genre: Research Report (0.40)

Industry:

Information Technology (0.47)
Semiconductors & Electronics (0.46)

Technology:

Information Technology > Hardware > Memory (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

Peng, Jie, Cao, Zhang, Qu, Huaizhi, Zhang, Zhengyu, Guo, Chang, Zhang, Yanyong, Cao, Zhichao, Chen, Tianlong

arXiv.org Artificial IntelligenceOct-22-2024

Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H$100$, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M$40$ (as shown in Figure 1, M$40$ only has one third carbon emission of H$100$'s) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$B parameters typically requires $128$GB for inference, which substantially surpasses $24$GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$GB DRAM. To address this challenge, this paper proposes a mixed-precision with a model modularization algorithm to enable LLM inference on outdated hardware with resource constraints. (The precision denotes the numerical precision like FP16, INT8, INT4) and multi-level caching (M2Cache).) Specifically, our M2Cache first modulizes neurons in LLM and creates their importance ranking. Then, it adopts a dynamic sparse mixed-precision quantization mechanism in weight space to reduce computational demands and communication overhead at each decoding step. It collectively lowers the operational carbon emissions associated with LLM inference. Moreover, M2Cache introduces a three-level cache management system with HBM, DRAM, and SSDs that complements the dynamic sparse mixed-precision inference. To enhance communication efficiency, M2Cache maintains a neuron-level mixed-precision LRU cache in HBM, a larger layer-aware cache in DRAM, and a full model in SSD.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.1474

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
North America > United States > North Carolina (0.04)
(7 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Energy > Renewable (1.00)
Energy > Coal (0.77)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Chen, Shimao, Liu, Zirui, Wu, Zhiying, Zheng, Ce, Cong, Peizhuang, Jiang, Zihan, Wu, Yuhan, Su, Lei, Yang, Tong

arXiv.org Artificial IntelligenceSep-26-2024

As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.

arxiv preprint arxiv, flashattention, int-flashattention, (15 more...)

arXiv.org Artificial Intelligence

2409.16997

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

Qin, Zhen, Sun, Weigao, Li, Dong, Shen, Xuyang, Sun, Weixuan, Zhong, Yiran

arXiv.org Artificial IntelligenceJun-20-2024

We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention. We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. TNL is notably more efficient than other language models. In addition, benchmark results indicate that TNL performs on par with state-of-the-art LLMs utilizing conventional transformer structures. The source code is released at github.com/OpenNLPLab/TransnormerLLM.

language model, lightning attention, tnl, (14 more...)

arXiv.org Artificial Intelligence

2405.17381

Country:

North America > United States (0.14)
Europe > Austria > Vienna (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback