AITopics | performance counter

Collaborating Authors

performance counter

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Mejbah Alam, Justin Gottschlich, Nesime Tatbul, Javier S. Turek, Tim Mattson, Abdullah Muzahid

Neural Information Processing SystemsFeb-13-2026, 05:01:26 GMT

Neural Information Processing Systems http://nips.cc/

autoencoder, international conference, proceedings, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > District of Columbia > Washington (0.05)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > Texas > Brazos County > College Station (0.04)
(4 more...)

Genre:

Research Report > Promising Solution (0.68)
Research Report > New Finding (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Mejbah Alam, Justin Gottschlich, Nesime Tatbul, Javier S. Turek, Tim Mattson, Abdullah Muzahid

Neural Information Processing SystemsOct-3-2025, 07:47:47 GMT

The field of machine programming (MP), the automation of the development of software, is making notable research advances.

artificial intelligence, machine learning, proceedings, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > District of Columbia > Washington (0.05)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > Texas > Brazos County > College Station (0.04)
(4 more...)

Genre:

Research Report > Promising Solution (0.68)
Research Report > New Finding (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Omniwise: Predicting GPU Kernels Performance with LLMs

Wang, Zixian, Ramos, Cole, Awad, Muhammad A., Lowery, Keith

arXiv.org Artificial IntelligenceJun-27-2025

In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline that applies large language models (LLMs) to GPU kernel performance prediction--a novel use case in performance profiling. Omniwise is model-agnostic and lightweight, achieving strong results even with a small 3B-parameter model. It can predict key performance metrics, including memory bandwidth, cache hit rates, GFLOPs, and arithmetic intensity, directly from kernel code without the need for code execution or profiling tools. Our approach achieves over 90% of predictions within 10% relative error on GPU kernels executed on AMD MI250 and MI300X architectures. In addition to the pipeline, we develop an online inference server and a Visual Studio Code plugin that seamlessly integrate LLM-based performance prediction into developers' workflows.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.20886

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Runtime Detection of Adversarial Attacks in AI Accelerators Using Performance Counters

Rahaman, Habibur, Chatterjee, Atri, Bhunia, Swarup

arXiv.org Artificial IntelligenceMar-10-2025

Rapid adoption of AI technologies raises several major security concerns, including the risks of adversarial perturbations, which threaten the confidentiality and integrity of AI applications. Protecting AI hardware from misuse and diverse security threats is a challenging task. To address this challenge, we propose SAMURAI, a novel framework for safeguarding against malicious usage of AI hardware and its resilience to attacks. SAMURAI introduces an AI Performance Counter (APC) for tracking dynamic behavior of an AI model coupled with an on-chip Machine Learning (ML) analysis engine, known as TANTO (Trained Anomaly Inspection Through Trace Observation). APC records the runtime profile of the low-level hardware events of different AI operations. Subsequently, the summary information recorded by the APC is processed by TANTO to efficiently identify potential security breaches and ensure secure, responsible use of AI. SAMURAI enables real-time detection of security threats and misuse without relying on traditional software-based solutions that require model integration. Experimental results demonstrate that SAMURAI achieves up to 97% accuracy in detecting adversarial attacks with moderate overhead on various AI models, significantly outperforming conventional software-based approaches. It enhances security and regulatory compliance, providing a comprehensive solution for safeguarding AI against emergent threats.

apc metric, opération, tanto, (12 more...)

arXiv.org Artificial Intelligence

2503.07568

Country: North America > United States > California > Los Angeles County > Santa Monica (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

{\mu}RL: Discovering Transient Execution Vulnerabilities Using Reinforcement Learning

Tol, M. Caner, Derya, Kemal, Sunar, Berk

arXiv.org Artificial IntelligenceFeb-20-2025

We propose using reinforcement learning to address the challenges of discovering microarchitectural vulnerabilities, such as Spectre and Meltdown, which exploit subtle interactions in modern processors. Traditional methods like random fuzzing fail to efficiently explore the vast instruction space and often miss vulnerabilities that manifest under specific conditions. To overcome this, we introduce an intelligent, feedback-driven approach using RL. Our RL agents interact with the processor, learning from real-time feedback to prioritize instruction sequences more likely to reveal vulnerabilities, significantly improving the efficiency of the discovery process. We also demonstrate that RL systems adapt effectively to various microarchitectures, providing a scalable solution across processor generations. By automating the exploration process, we reduce the need for human intervention, enabling continuous learning that uncovers hidden vulnerabilities. Additionally, our approach detects subtle signals, such as timing anomalies or unusual cache behavior, that may indicate microarchitectural weaknesses. This proposal advances hardware security testing by introducing a more efficient, adaptive, and systematic framework for protecting modern processors. When unleashed on Intel Skylake-X and Raptor Lake microarchitectures, our RL agent was indeed able to generate instruction sequences that cause significant observable byte leakages through transient execution without generating any $\mu$code assists, faults or interrupts. The newly identified leaky sequences stem from a variety of Intel instructions, e.g. including SERIALIZE, VERR/VERW, CLMUL, MMX-x87 transitions, LSL+RDSCP and LAR. These initial results give credence to the proposed approach.

instruction, instruction sequence, vulnerability, (15 more...)

arXiv.org Artificial Intelligence

2502.14307

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

Reviews: A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Neural Information Processing SystemsFeb-11-2025, 22:47:46 GMT

This paper describes a system for detecting the source of performance regressions in source code. The idea is to measure performance counters (HPCs) at a per-function level of the code, and then when a performance regression is detected, it is localized by looking for the function with most anomalous performance counters. The anomaly detection is done by training autoencoders on the HPCs, and there is a further idea to cluster functions with similar behavior profiles to avoid the need for learning an autoencoder for every function in a large code base. This is a controversial paper because there is little methodological novelty. R1 gave the lowest score and asks whether we want to allow this kind of paper in NeurIPS, worrying that if we accept any application of ML, then NeurIPS risks becoming too broad.

autoencoder, diagnosing software performance regression, zero-positive learning approach, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.98)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.65)

Add feedback

Performance Optimization using Multimodal Modeling and Heterogeneous GNN

Dutta, Akash, Alcaraz, Jordi, TehraniJamsaz, Ali, Cesar, Eduardo, Sikora, Anna, Jannesari, Ali

arXiv.org Artificial IntelligenceApr-27-2023

Growing heterogeneity and configurability in HPC architectures has made auto-tuning applications and runtime parameters on these systems very complex. Users are presented with a multitude of options to configure parameters. In addition to application specific solutions, a common approach is to use general purpose search strategies, which often might not identify the best configurations or their time to convergence is a significant barrier. There is, thus, a need for a general purpose and efficient tuning approach that can be easily scaled and adapted to various tuning tasks. We propose a technique for tuning parallel code regions that is general enough to be adapted to multiple tasks. In this paper, we analyze IR-based programming models to make task-specific performance optimizations. To this end, we propose the Multimodal Graph Neural Network and Autoencoder (MGA) tuner, a multimodal deep learning based approach that adapts Heterogeneous Graph Neural Networks and Denoizing Autoencoders for modeling IR-based code representations that serve as separate modalities. This approach is used as part of our pipeline to model a syntax, semantics, and structure-aware IR-based code representation for tuning parallel code regions/kernels. We extensively experiment on OpenMP and OpenCL code regions/kernels obtained from PolyBench, Rodinia, STREAM, DataRaceBench, AMD SDK, NPB, NVIDIA SDK, Parboil, SHOC, and LULESH benchmarks. We apply our multimodal learning techniques to the tasks of i) optimizing the number of threads, scheduling policy and chunk size in OpenMP loops and, ii) identifying the best device for heterogeneous device mapping of OpenCL kernels. Our experiments show that this multimodal learning based approach outperforms the state-of-the-art in all experiments.

artificial intelligence, experiment, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2304.12568

Country:

North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
North America > United States > Oregon > Lane County > Eugene (0.14)
Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
(6 more...)

Genre: Research Report (1.00)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Power Constrained Autotuning using Graph Neural Networks

Dutta, Akash, Choi, Jee, Jannesari, Ali

arXiv.org Artificial IntelligenceFeb-22-2023

Recent advances in multi and many-core processors have led to significant improvements in the performance of scientific computing applications. However, the addition of a large number of complex cores have also increased the overall power consumption, and power has become a first-order design constraint in modern processors. While we can limit power consumption by simply applying software-based power constraints, applying them blindly will lead to non-trivial performance degradation. To address the challenge of improving the performance, power, and energy efficiency of scientific applications on modern multi-core processors, we propose a novel Graph Neural Network based auto-tuning approach that (i) optimizes runtime performance at pre-defined power constraints, and (ii) simultaneously optimizes for runtime performance and energy efficiency by minimizing the energy-delay product. The key idea behind this approach lies in modeling parallel code regions as flow-aware code graphs to capture both semantic and structural code features. We demonstrate the efficacy of our approach by conducting an extensive evaluation on $30$ benchmarks and proxy-/mini-applications with $68$ OpenMP code regions. Our approach identifies OpenMP configurations at different power constraints that yield a geometric mean performance improvement of more than $25\%$ and $13\%$ over the default OpenMP configuration on a 32-core Skylake and a $16$-core Haswell processor respectively. In addition, when we optimize for the energy-delay product, the OpenMP configurations selected by our auto-tuner demonstrate both performance improvement of $21\%$ and $11\%$ and energy reduction of $29\%$ and $18\%$ over the default OpenMP configuration at Thermal Design Power for the same Skylake and Haswell processors, respectively.

artificial intelligence, configuration, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2302.11467

Country:

North America > United States > Iowa (0.05)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States > Oregon (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry:

Energy (0.47)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Dynamic GPU Energy Optimization for Machine Learning Training Workloads

Wang, Farui, Zhang, Weizhe, Lai, Shichao, Hao, Meng, Wang, Zheng

arXiv.org Artificial IntelligenceJan-5-2022

GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.

application, clock frequency, frequency, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TPDS.2021.3137867

2201.01684

Country:

Asia > China > Heilongjiang Province > Harbin (0.04)
Europe > United Kingdom > England > West Yorkshire > Leeds (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.84)

Industry: Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
(2 more...)

Add feedback

MAPLE: Microprocessor A Priori for Latency Estimation

Abbasi, Saad, Wong, Alexander, Shafiee, Mohammad Javad

arXiv.org Artificial IntelligenceNov-29-2021

Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. As such, neural architecture search (NAS) algorithms take these two constraints into account when generating a new architecture. However, efficiency metrics such as latency are typically hardware dependent requiring the NAS algorithm to either measure or predict the architecture latency. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. Here we propose Microprocessor A Priori for Latency Estimation MAPLE that does not rely on transfer learning or domain adaptation but instead generalizes to new hardware by incorporating a prior hardware characteristics during training. MAPLE takes advantage of a novel quantitative strategy to characterize the underlying microprocessor by measuring relevant hardware performance metrics, yielding a fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE benefits from the tightly coupled I/O between the CPU and GPU and their dependency to predict DNN latency on GPUs while measuring microprocessor performance hardware counters from the CPU feeding the GPU hardware. Through this quantitative strategy as the hardware descriptor, MAPLE can generalize to new hardware via a few shot adaptation strategy where with as few as 3 samples it exhibits a 3% improvement over state-of-the-art methods requiring as much as 10 samples. Experimental results showed that, increasing the few shot adaptation samples to 10 improves the accuracy significantly over the state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE exhibiting 8-10% better accuracy, on average, compared to relevant baselines at any number of adaptation samples.

architecture, hardware, latency, (13 more...)

arXiv.org Artificial Intelligence

2111.15106

Country: North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)

Genre:

Research Report > Promising Solution (0.86)
Research Report > New Finding (0.68)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback