tile size
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (10 more...)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (10 more...)
Guiding Application Users via Estimation of Computational Resources for Massively Parallel Chemistry Computations
Tabassum, Tanzila, Subasi, Omer, Panyala, Ajay, Ebiapia, Epiya, Baumgartner, Gerald, Mutlu, Erdal, P., null, Sadayappan, null, Kowalski, Karol
In this work, we develop machine learning (ML) based strategies to predict resources (costs) required for massively parallel chemistry computations, such as coupled-cluster methods, to guide application users before they commit to running expensive experiments on a supercomputer. By predicting application execution time, we determine the optimal runtime parameter values such as number of nodes and tile sizes. Two key questions of interest to users are addressed. The first is the shortest-time question, where the user is interested in knowing the parameter configurations (number of nodes and tile sizes) to achieve the shortest execution time for a given problem size and a target supercomputer. The second is the cheapest-run question in which the user is interested in minimizing resource usage, i.e., finding the number of nodes and tile size that minimizes the number of node-hours for a given problem size. We evaluate a rich family of ML models and strategies, developed based on the collections of runtime parameter values for the CCSD (Coupled Cluster with Singles and Doubles) application executed on the Department of Energy (DOE) Frontier and Aurora supercomputers. Our experiments show that when predicting the total execution time of a CCSD iteration, a Gradient Boosting (GB) ML model achieves a Mean Absolute Percentage Error (MAPE) of 0.023 and 0.073 for Aurora and Frontier, respectively. In the case where it is expensive to run experiments just to collect data points, we show that active learning can achieve a MAPE of about 0.2 with just around 450 experiments collected from Aurora and Frontier.
- North America > United States > Utah (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (2 more...)
- Energy (1.00)
- Government > Regional Government (0.67)
ENA: Efficient N-dimensional Attention
TL;DR: A layer-interleaved hybrid architecture of linear recurrence and attention (full or local) matches Transformer performance on high-order data with greater efficiency. Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformer. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for language modeling, to high-order data (1D to ND): scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention and find that tiled high-order sliding window attention (SW A) is efficient in both theory and practice. We then conduct several experiments to demonstrate its effectiveness. The intuition behind ENA is that linear recurrence compresses global information into a state, while SW A complements it by enforcing strict local modeling. Together, they form a simple framework that offers a promising and practical solution for ultra-long high-order data modeling. Note that although we ultimately perform no sequence permutation, the framework remains compatible with any scanning. Softmax attention in LLMs has quadratic time complexity, making it inefficient for long sequences. Representative variants include RetNet Sun et al. (2023), HGRN Qin et al. (2024), GLA Y ang et al. (2024a), GSA Zhang et al. (2024), Mamba Gu & Dao (2023) and RWKV Peng et al. (2023). Subsequent advancements such as DeltaNet Y ang et al. (2024b), Gated DeltaNet Y ang et al. (2025), DeltaProduct Siems et al. (2025), LaCT Zhang et al. (2025b) and MesaNet von Oswald et al. (2025) further enhance expressiveness while preserving linear-time complexity and are optimized for parallel training. In this paper, "linear recurrent models", "linear recurrence" and "linear models" are used interchangeably to denote models that perform sequence modeling via state updates with linear time complexity. The same applies to "softmax attention" and "full attention". For convenience, we refer to all data with more than one dimension as higher-order data, including images, which have only two dimensions.
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Song, Mingcong, Tang, Xinru, Hou, Fengfan, Li, Jing, Wei, Wei, Ma, Yipeng, Xiao, Runqiu, Si, Hongjie, Jiang, Dingcheng, Yin, Shouyi, Hu, Yang, Long, Guoping
Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes. XY-Serve sits harmoniously with vLLM. Experimental results show up to 89% end-to-end throughput improvement compared with current publicly available baselines on Ascend NPUs. Additionally, our approach outperforms existing GEMM (average 14.6% faster) and attention (average 21.5% faster) kernels relative to existing libraries. While the work is Ascend native, we believe the approach can be readily applicable to SIMT architectures as well.
- North America > United States (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs
Kabir, Ehsan, Downey, Austin R. J., Bakos, Jason D., Andrews, David, Huang, Miaoqing
Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.
- North America > United States > Arkansas > Washington County > Fayetteville (0.14)
- North America > United States > South Carolina > Richland County > Columbia (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (13 more...)
- Semiconductors & Electronics (0.88)
- Information Technology (0.68)
- Media > Television (0.35)
Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach
Xiaoteng, null, Liu, null, Halim, Pavly
Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix multiplication kernel for fundamental analysis, and NVIDIA's CUTLASS library for comprehensive performance data collection across advanced configurations. Using the NVIDIA RTX 4070 as our experimental platform, we developed a Random Forest-based prediction model with multi-output regression capability. Through analysis of both naive tiled matrix multiplication with varying tile sizes (1 to 32) and 16,128 CUTLASS GEMM operations across diverse configurations, we identified critical performance patterns related to matrix dimensions, thread block configurations, and memory access patterns. Our framework achieved exceptional accuracy with an R^2 score of 0.98 for runtime prediction (mean error 15.57%) and 0.78 for power prediction (median error 5.42%). The system successfully predicts performance across matrix sizes, demonstrating robust scaling behavior. Our results show that optimal tile size selection can improve performance by up to 3.2x while reducing power consumption by 22% compared to baseline configurations. Analysis of shared memory utilization and SM occupancy reveals that tile sizes of 16x16 achieve the best balance between parallelism and resource usage. The implementation of our framework, including prediction models and analysis tools, is available as an open-source project at GPPerf [https://github.com/pavlyhalim/GPPerf].
- Information Technology > Hardware (0.82)
- Energy (0.71)
Medical X-Ray Image Enhancement Using Global Contrast-Limited Adaptive Histogram Equalization
Nia, Sohrab Namazi, Shih, Frank Y.
In medical imaging, accurate diagnosis heavily relies on effective image enhancement techniques, particularly for X-ray images. Existing methods often suffer from various challenges such as sacrificing global image characteristics over local image characteristics or vice versa. In this paper, we present a novel approach, called G-CLAHE (Global-Contrast Limited Adaptive Histogram Equalization), which perfectly suits medical imaging with a focus on X-rays. This method adapts from Global Histogram Equalization (GHE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) to take both advantages and avoid weakness to preserve local and global characteristics. Experimental results show that it can significantly improve current state-of-the-art algorithms to effectively address their limitations and enhance the contrast and quality of X-ray images for diagnostic accuracy.
- North America > United States > New Jersey > Essex County > Newark (0.14)
- Asia > Taiwan (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- (11 more...)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)
FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs
Kabir, Ehsan, Kabir, Md. Arafat, Downey, Austin R. J., Bakos, Jason D., Andrews, David, Huang, Miaoqing
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes \textit{FAMOUS}, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28$\times$ and 2.6$\times$ faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3$\times$ faster than the fastest state-of-the-art FPGA-based accelerator.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > South Carolina (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- (8 more...)
- Information Technology (0.68)
- Media (0.55)
- Semiconductors & Electronics (0.49)
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Gao, Zhaolin, Zhan, Wenhao, Chang, Jonathan D., Swamy, Gokul, Brantley, Kianté, Lee, Jason D., Sun, Wen
Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate $Q$-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at https://github.com/ZhaolinGao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York (0.05)
- North America > United States > Rhode Island (0.04)
- (6 more...)
- Workflow (0.93)
- Research Report (0.84)