microarchitecture
A Joint Learning Approach to Hardware Caching and Prefetching
Yuan, Samuel, Saxena, Divyanshu, Chen, Jiayi, Sharma, Nihal, Akella, Aditya
Several learned policies have been proposed to replace heuristics for scheduling, caching, and other system components in modern systems. By leveraging diverse features, learning from historical trends, and predicting future behaviors, such models promise to keep pace with ever-increasing workload dynamism and continuous hardware evolution. However, policies trained in isolation may still achieve suboptimal performance when placed together. In this paper, we inspect one such instance in the domain of hardware caching - for the policies of cache replacement and prefetching. We argue that these two policies are bidirection-ally interdependent and make the case for training the two jointly. We propose a joint learning approach based on developing shared representations for the features used by the two policies. We present two approaches to develop these shared representations, one based on a joint encoder and another based on contrastive learning of the embeddings, and demonstrate promising preliminary results for both of these. Finally, we lay down an agenda for future research in this direction.
Bang for the Buck: Vector Search on Cloud CPUs
Vector databases have emerged as a new type of systems that support efficient querying of high-dimensional vectors. Many of these offer their database as a service in the cloud. However, the variety of available CPUs and the lack of vector search benchmarks across CPUs make it difficult for users to choose one. In this study, we show that CPU microarchitectures available in the cloud perform significantly differently across vector search scenarios. For instance, in an IVF index on float32 vectors, AMD's Zen4 gives almost 3x more queries per second (QPS) compared to Intel's Sapphire Rapids, but for HNSW indexes, the tables turn. However, when looking at the number of queries per dollar (QP$), Graviton3 is the best option for most indexes and quantization settings, even over Graviton4 (Table 1). With this work, we hope to guide users in getting the best "bang for the buck" when deploying vector search systems.
- Europe > Germany > Berlin (0.05)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion
Nasr-Esfahany, Arash, Alizadeh, Mohammad, Lee, Victor, Alam, Hanna, Coon, Brett W., Culler, David, Dadu, Vidushi, Dixon, Martin, Levy, Henry M., Pandey, Santosh, Ranganathan, Parthasarathy, Yazdanbakhsh, Amir
Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program's performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)
Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs
Wu, Qizhe, Liang, Huawen, Gui, Yuchen, Zeng, Zhichen, He, Zerong, Tao, Linfeng, Wang, Xiaotian, Zhao, Letian, Zeng, Zhaoxi, Yuan, Wei, Wu, Wei, Jin, Xi
General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or operand reuse strategies. However, considering the interaction between matrix multiplication and multiply-accumulators (MACs) offers greater optimization potential. This work introduces a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of MACs. We propose a finer-grained TPE notation using matrix triple loops as an example, introducing new methods for designing and optimizing PE microarchitectures. Based on this notation and its transformations, we propose four optimization techniques that improve timing, area, and power consumption. Implementing our design in RTL using the SMIC-28nm process, we evaluate its effectiveness across four classic TPE architectures: systolic array, 3D-Cube, multiplier-adder tree, and 2D-Matrix. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively. Applied to a bit-slice architecture, our approach achieves a 12.10x improvement in energy efficiency and 2.85x in area efficiency compared to Laconic. Our Verilog HDL code, along with timing, area, and power reports, is available at https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Colorado > Denver County > Denver (0.04)
- Asia > Taiwan (0.04)
- Asia > China (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Architecture (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.87)
- Information Technology > Artificial Intelligence > Natural Language (0.68)
Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
Elbtity, Mohammed, Chandarana, Peyton, Zand, Ramtin
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
- North America > United States > South Carolina > Richland County > Columbia (0.14)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Tao: Re-Thinking DL-based Microarchitecture Simulation
Pandey, Santosh, Yazdanbakhsh, Amir, Liu, Hang
Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, and optimize new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy but fail to provide adequate low-level microarchitectural performance metrics crucial for microarchitectural bottleneck analysis. Additionally, they introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture. Re-thinking the advantages and limitations of the aforementioned simulation paradigms, this paper introduces TAO that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, we redesign the input features and the DL model using self-attention to support predicting various performance metrics. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and reduces the re-training overhead of conventional DL-based simulators. Our extensive evaluation shows TAO can reduce the overall training and simulation time by 18.06x over the state-of-the-art DL-based endeavors.
- North America > United States > New Jersey > Middlesex County > New Brunswick (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)
- Asia > Middle East > Iraq > Najaf Governorate > Najaf (0.04)
- Research Report (0.50)
- Workflow (0.46)
COMET: X86 Cost Model Explanation Framework
Chaudhary, Isha, Renda, Alex, Mendis, Charith, Singh, Gagandeep
ML-based program cost models have been shown to yield fairly accurate program cost predictions. They can replace heavily-engineered analytical program cost models in mainstream compilers, but their black-box nature discourages their adoption. In this work, we propose the first framework, COMET, for generating faithful, generalizable, and intuitive explanations for x86 cost models. COMET brings interpretability specifically to ML-based cost models, such as Ithemal. We generate and compare COMET's explanations for Ithemal against COMET's explanations for a hand-crafted, accurate analytical model, uiCA. Our empirical findings show an inverse correlation between the error in the cost prediction of a cost model and the prominence of semantically-richer features in COMET's explanations for the cost model for a given x86 basic block.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Illinois (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.65)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
SemiconX:-AI Acceleration Hardware
Artificial Intelligence (AI) is a powerful tool that will be ubiquitous in the upcoming decade, in applications spanning across defense, automobiles, robotics, healthcare, metaverse, and industry 4.0. The increasing AI model capacities demand scaled high-throughput compute at iso-energy consumption which needs a fundamental rethinking of power saving in compute and dataflow. The basic difference of a custom AI hardware architecture w.r.t a general-purpose workload is that deep learning computation and dataflow are structured, and the network is known prior to execution, thus the underlying implementation architecture can be optimized specifically to the AI execution datapath and the hardware overhead for control path can be minimized. Because of the enormous potential in this space, it has gained a lot of traction from investors with several AI hardware startups raising $4B combined and a total valuation of around $10B. The foundational differences in computing architectures can also be classified based on the target applications whether it's for datacenter-scale AI (with both training and high-precision inference workloads) or for edge computing which deploys lightweight models at low-to-intermediate resolution.
NEON: Enabling Efficient Support for Nonlinear Operations in Resistive RAM-based Neural Network Accelerators
Manglik, Aditya, Patel, Minesh, Mao, Haiyu, Salami, Behzad, Park, Jisung, Orosa, Lois, Mutlu, Onur
Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide natively. Consequently, state-of-the-art works either integrate additional digital logic circuits to support the non-MAC operations or offload the non-MAC operations to CPU/GPU, resulting in significant performance and energy efficiency overheads due to data movement. In this work, we propose NEON, a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM. The key idea of NEON is to transform each non-MAC operation into a lightweight yet highly-accurate neural network. Utilizing neural networks to approximate the non-MAC operations provides two advantages: 1) We can exploit the key strength of RRAM, i.e., highly-parallel MAC operation, to flexibly and efficiently execute non-MAC operations in memory. 2) We can simplify RRAM's microarchitecture by eliminating the additional digital logic circuits while reducing the data movement overheads. Acceleration of the non-MAC operations in memory enables NEON to achieve a 2.28x speedup compared to an idealized digital logic-based RRAM. We analyze the trade-offs associated with the transformation and demonstrate feasible use cases for NEON across different substrates.
- North America > United States (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Hardware > Memory (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation
Sykora, Ondrej, Phothilimthana, Phitchaya Mangpo, Mendis, Charith, Yazdanbakhsh, Amir
Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% while improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation.