AITopics | memory hierarchy

Collaborating Authors

memory hierarchy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Role of Advanced Computer Architectures in Accelerating Artificial Intelligence Workloads

Amin, Shahid, Shah, Syed Pervez Hussnain

arXiv.org Artificial IntelligenceNov-14-2025

The remarkable progress in Artificial Intelligence (AI) is foundation-ally linked to a concurrent revolution in computer architecture. As AI models, particularly Deep Neural Networks (DNNs), have grown in complexity, their massive computational demands have pushed traditional architectures to their limits. This paper provides a structured review of this co-evolution, analyzing the architectural landscape designed to accelerate modern AI workloads. We explore the dominant architectural paradigms Graphics Processing Units (GPUs), Appli-cation-Specific Integrated Circuits (ASICs), and Field-Programmable Gate Ar-rays (FPGAs) by breaking down their design philosophies, key features, and per-formance trade-offs. The core principles essential for performance and energy efficiency, including dataflow optimization, advanced memory hierarchies, spar-sity, and quantization, are analyzed. Furthermore, this paper looks ahead to emerging technologies such as Processing-in-Memory (PIM) and neuromorphic computing, which may redefine future computation. By synthesizing architec-tural principles with quantitative performance data from industry-standard benchmarks, this survey presents a comprehensive picture of the AI accelerator landscape. We conclude that AI and computer architecture are in a symbiotic relationship, where hardware-software co-design is no longer an optimization but a necessity for future progress in computing.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.1001

Genre: Overview (1.00)

Industry:

Information Technology (1.00)
Semiconductors & Electronics (0.67)

Technology:

Information Technology > Architecture (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

Wang, Erwei, Bayliss, Samuel, Bisca, Andra, Blair, Zachary, Chowdhary, Sangeeta, Denolf, Kristof, Fifield, Jeff, Freiberger, Brandon, Hunhoff, Erika, James-Roxby, Phil, Lo, Jack, Melber, Joseph, Neuendorffer, Stephen, Richter, Eddie, Rosti, Andre, Setoain, Javier, Singh, Gagandeep, Taka, Endri, Vasireddy, Pranathi, Yu, Zhewen, Zhang, Niansong, Zhuang, Jinming

arXiv.org Artificial IntelligenceOct-17-2025

General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.

artificial intelligence, machine learning, opération, (18 more...)

arXiv.org Artificial Intelligence

2510.14871

Genre: Research Report > New Finding (0.46)

Industry:

Semiconductors & Electronics (1.00)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Computing-In-Memory Dataflow for Minimal Buffer Traffic

Song, Choongseok, Jeong, Doo Seok

arXiv.org Artificial IntelligenceAug-21-2025

--Computing-In-Memory (CIM) offers a potential solution to the memory wall issue and can achieve high energy efficiency by minimizing data movement, making it a promising architecture for edge AI devices. Lightweight models like MobileNet and EfficientNet, which utilize depthwise convolution for feature extraction, have been developed for these devices. However, CIM macros often face challenges in accelerating depth-wise convolution, including underutilization of CIM memory and heavy buffer traffic. The latter, in particular, has been overlooked despite its significant impact on latency and energy consumption. T o address this, we introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization during depthwise convolution. The proposed dataflow is grounded in solid theoretical principles, fully demonstrated in this paper . When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4-87.0%, Convolutional neural networks (CNNs) have achieved remarkable success in computer vision, excelling in spatial feature extraction [1].

artificial intelligence, machine learning, traffic, (20 more...)

arXiv.org Artificial Intelligence

2508.14375

Genre: Research Report (0.70)

Industry: Energy (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

Dong, Shouyang, Wen, Yuanbo, Bi, Jun, Huang, Di, Guo, Jiaming, Xu, Jianxing, Xu, Ruibai, Song, Xinkai, Hao, Yifan, Zhou, Xuehai, Chen, Tianshi, Guo, Qi, Chen, Yunji

arXiv.org Artificial IntelligenceMay-6-2025

Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.02146

Country: Europe > Austria (0.28)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Review for NeurIPS paper: HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory

Neural Information Processing SystemsFeb-11-2025, 22:47:07 GMT

The paper attempts to scale nearest neighbor search using heterogenous memory hardware. In this regard, authors devised a practical trick on top of HNSW. It is a clean node promotion strategy along the memory hierarchy using the degree information. The method was evaluated on some common large datasets, but not necessarily difficult ones. Reviewers found the setup to leverage the memory hierarchy interesting and the benefits obtained from it appears promising.

efficient billion-point nearest neighbor search, heterogeneous memory, memory hierarchy, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.66)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.66)

Add feedback

TinyM$^2$Net-V3: Memory-Aware Compressed Multimodal Deep Neural Networks for Sustainable Edge Deployment

Rashid, Hasib-Al, Mohsenin, Tinoosh

arXiv.org Artificial IntelligenceMay-20-2024

The advancement of sophisticated artificial intelligence (AI) algorithms has led to a notable increase in energy usage and carbon dioxide emissions, intensifying concerns about climate change. This growing problem has brought the environmental sustainability of AI technologies to the forefront, especially as they expand across various sectors. In response to these challenges, there is an urgent need for the development of sustainable AI solutions. These solutions must focus on energy-efficient embedded systems that are capable of handling diverse data types even in environments with limited resources, thereby ensuring both technological progress and environmental responsibility. Integrating complementary multimodal data into tiny machine learning models for edge devices is challenging due to increased complexity, latency, and power consumption. This work introduces TinyM$^2$Net-V3, a system that processes different modalities of complementary data, designs deep neural network (DNN) models, and employs model compression techniques including knowledge distillation and low bit-width quantization with memory-aware considerations to fit models within lower memory hierarchy levels, reducing latency and enhancing energy efficiency on resource-constrained devices. We evaluated TinyM$^2$Net-V3 in two multimodal case studies: COVID-19 detection using cough, speech, and breathing audios, and pose classification from depth and thermal images. With tiny inference models (6 KB and 58 KB), we achieved 92.95% and 90.7% accuracies, respectively. Our tiny machine learning models, deployed on resource limited hardware, demonstrated low latencies within milliseconds and very high power efficiency.

accuracy, net-v3, tinym 2, (14 more...)

arXiv.org Artificial Intelligence

2405.12353

Country:

North America > United States > Maryland > Baltimore County (0.04)
North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.37)
Health & Medicine > Therapeutic Area > Immunology (0.37)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)

Add feedback

A Configurable and Efficient Memory Hierarchy for Neural Network Hardware Accelerator

Bause, Oliver, Bernardo, Paul Palomero, Bringmann, Oliver

arXiv.org Artificial IntelligenceApr-24-2024

As machine learning applications continue to evolve, the demand for efficient hardware accelerators, specifically tailored for deep neural networks (DNNs), becomes increasingly vital. In this paper, we propose a configurable memory hierarchy framework tailored for per layer adaptive memory access patterns of DNNs. The hierarchy requests data on-demand from the off-chip memory to provide it to the accelerator's compute units. The objective is to strike an optimized balance between minimizing the required memory capacity and maintaining high accelerator performance. The framework is characterized by its configurability, allowing the creation of a tailored memory hierarchy with up to five levels. Furthermore, the framework incorporates an optional shift register as final level to increase the flexibility of the memory management process. A comprehensive loop-nest analysis of DNN layers shows that the framework can efficiently execute the access patterns of most loop unrolls. Synthesis results and a case study of the DNN accelerator UltraTrail indicate a possible reduction in chip area of up to 62.2% as smaller memory modules can be used. At the same time, the performance loss can be minimized to 2.4%.

accelerator, hierarchy, hierarchy level, (17 more...)

arXiv.org Artificial Intelligence

2404.15823

Country: Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Cost-effective On-device Continual Learning over Memory Hierarchy with Miro

Ma, Xinyue, Jeong, Suyeon, Zhang, Minjia, Wang, Di, Choi, Jonghyun, Jeon, Myeongjae

arXiv.org Artificial IntelligenceDec-5-2023

Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.

accuracy, conf, learning, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3570361.3613297

2308.06053

Country:

Europe > Spain > Galicia > Madrid (0.05)
North America > United States > Florida > Orange County > Orlando (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Setting (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)

Add feedback

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Ma, Zixuan, Wang, Haojie, Xing, Jingze, Zheng, Liyan, Zhang, Chen, Cao, Huanqi, Huang, Kezhao, Tang, Shizhi, Wang, Penghan, Zhai, Jidong

arXiv.org Artificial IntelligenceJul-10-2023

Deep neural networks (DNNs) are of critical use in different domains. To accelerate DNN computation, tensor compilers are proposed to generate efficient code on different domain-specific accelerators. Existing tensor compilers mainly focus on optimizing computation efficiency. However, memory access is becoming a key performance bottleneck because the computational performance of accelerators is increasing much faster than memory performance. The lack of direct description of memory access and data dependence in current tensor compilers' intermediate representation (IR) brings significant challenges to generate memory-efficient code. In this paper, we propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators by considering both computation and data movement optimizations. IntelliGen represent a DNN program using GIR, which includes primitives indicating its computation, data movement, and parallel strategies. This information will be further composed as an instruction-level dataflow graph to perform holistic optimizations by searching different memory access patterns and computation operations, and generating memory-efficient code on different hardware. We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x(1.28x, 1.23x, and 2.31x on average), respectively, compared to current most performant frameworks.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.04995

Country:

North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(2 more...)

Add feedback

KAPLA: Pragmatic Representation and Fast Solving of Scalable NN Accelerator Dataflow

Li, Zhiyao, Gao, Mingyu

arXiv.org Artificial IntelligenceJun-8-2023

Dataflow scheduling decisions are of vital importance to neural network (NN) accelerators. Recent scalable NN accelerators support a rich set of advanced dataflow techniques. The problems of comprehensively representing and quickly finding optimized dataflow schemes thus become significantly more complicated and challenging. In this work, we first propose comprehensive and pragmatic dataflow representations for temporal and spatial scheduling on scalable multi-node NN architectures. An informal hierarchical taxonomy highlights the tight coupling across different levels of the dataflow space as the major difficulty for fast design exploration. A set of formal tensor-centric directives accurately express various inter-layer and intra-layer schemes, and allow for quickly determining their validity and efficiency. We then build a generic, optimized, and fast dataflow solver, KAPLA, which makes use of the pragmatic directives to explore the design space with effective validity check and efficiency estimation. KAPLA decouples the upper inter-layer level for fast pruning, and solves the lower intra-layer schemes with a novel bottom-up cost descending method. KAPLA achieves within only 2.2% and 7.7% energy overheads on the result dataflow for training and inference, respectively, compared to the exhaustively searched optimal schemes. It also outperforms random and machine-learning-based approaches, with more optimized results and orders of magnitude faster search speedup.

artificial intelligence, machine learning, tensor, (15 more...)

arXiv.org Artificial Intelligence

2306.15676

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback