AITopics | gpu architecture

Collaborating Authors

gpu architecture

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ProbSelect: Stochastic Client Selection for GPU-Accelerated Compute Devices in the 3D Continuum

Stanisic, Andrija, Nastic, Stefan

arXiv.org Artificial IntelligenceNov-12-2025

Abstract--Integration of edge, cloud and space devices into a unified 3D continuum imposes significant challenges for client selection in federated learning systems. Traditional approaches rely on continuous monitoring and historical data collection, which becomes impractical in dynamic environments where satellites and mobile devices frequently change operational conditions. Furthermore, existing solutions primarily consider CPU-based computation, failing to capture complex characteristics of GPU-accelerated training that is prevalent across the 3D continuum. This paper introduces ProbSelect, a novel approach utilizing analytical modeling and probabilistic forecasting for client selection on GPU-accelerated devices, without requiring historical data or continuous monitoring. Extensive evaluation across diverse GPU architectures and workloads demonstrates that ProbSelect improves SLO compliance by 13.77% on average while achieving 72.5% computational waste reduction compared to baseline approaches.

artificial intelligence, machine learning, selection, (17 more...)

arXiv.org Artificial Intelligence

2511.08147

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Communications > Networks (1.00)
(3 more...)

Add feedback

QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

Zhou, Qirui, Peng, Shaohui, Xiong, Weiqiang, Chen, Haixin, Wen, Yuanbo, Li, Haochen, Li, Ling, Guo, Qi, Zhao, Yongwei, Gao, Ke, Chen, Ruizhi, Wu, Yanjun, Zhao, Chen, Chen, Yunji

arXiv.org Artificial IntelligenceJun-17-2025

The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.

attention operator, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2506.12355

Country: Europe > Austria (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation

He, Guoliang, Yoneki, Eiko

arXiv.org Artificial IntelligenceMar-25-2024

Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2403.16863

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
Europe > Greece > Attica > Athens (0.06)
North America > United States > New York > New York County > New York City (0.05)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Optimistic Verifiable Training by Controlling Hardware Nondeterminism

Srivastava, Megha, Arora, Simran, Boneh, Dan

arXiv.org Artificial IntelligenceMar-16-2024

The increasing compute demands of AI systems has led to the emergence of services that train models on behalf of clients lacking necessary resources. However, ensuring correctness of training and guarding against potential training-time attacks, such as data poisoning, poses challenges. Existing works on verifiable training largely fall into two classes: proof-based systems, which struggle to scale due to requiring cryptographic techniques, and "optimistic" methods that consider a trusted third-party auditor who replicates the training process. A key challenge with the latter is that hardware nondeterminism between GPU types during training prevents an auditor from replicating the training process exactly, and such schemes are therefore non-robust. We propose a method that combines training in a higher precision than the target model, rounding after intermediate computation steps, and storing rounding decisions based on an adaptive thresholding procedure, to successfully control for nondeterminism. Across three different NVIDIA GPUs (A40, Titan XP, RTX 2080 Ti), we achieve exact training replication at FP32 precision for both full-training and fine-tuning of ResNet-50 (23M) and GPT-2 (117M) models. Our verifiable training scheme significantly decreases the storage and time costs compared to proof-based systems.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2403.09603

Country:

North America > United States > New Mexico (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Randall, Thomas, Allen, Tyler, Ge, Rong

arXiv.org Artificial IntelligenceDec-12-2023

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

architecture, full-w2v, implementation, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3447818.3460373

2312.07743

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy (0.04)

Genre: Research Report (0.50)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Minuet: Accelerating 3D Sparse Convolutions on GPUs

Yang, Jiacheng, Giannoula, Christina, Wu, Jun, Elhoushi, Mostafa, Gleeson, James, Pekhimenko, Gennady

arXiv.org Artificial IntelligenceDec-1-2023

Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary General Matrix Multiplication (GEMM) operations to be executed (Map step), and then use a Gather-GEMM-Scatter process to execute these GEMM operations (GMaS step). In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs. Minuet proposes to (i) replace the hash tables used in the Map step with a novel segmented sorting double-traversed binary search algorithm that highly utilizes the on-chip memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile size in the Gather and Scatter operations of the GMaS step, such that to adapt the execution to the particular characteristics of each SC layer, dataset, and GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach that reduces both memory padding and kernel launching overheads. Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74\times$ (up to $2.22\times$) for end-to-end point cloud network executions. Our novel segmented sorting double-traversed binary search algorithm achieves superior speedups by $15.8\times$ on average (up to $26.8\times$) over prior SC engines in the Map step. The source code of Minuet is publicly available at https://github.com/UofT-EcoSystem/Minuet.

minuet, opération, query, (14 more...)

arXiv.org Artificial Intelligence

2401.06145

Country:

North America > Canada > Ontario > Toronto (0.28)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Nvidia launches a new GPU architecture and the Grace CPU Superchip – TechCrunch

#artificialintelligenceMar-24-2022, 09:00:17 GMT

At its annual GTC conference for AI developers, Nvidia today announced its next-gen Hopper GPU architecture and the Hopper H100 GPU, as well as a new data center chip that combines the GPU with a high-performance CPU, which Nvidia calls the "Grace CPU Superchip" (not to be confused with the Grace Hopper Superchip). With Hopper, Nvidia is launching a number of new and updated technologies, but for AI developers, the most important one may just be the architecture's focus on transformer models, which have become the machine learning technique de rigueur for many use cases and which powers models like GPT-3 and asBERT. The new Transformer Engine in the H100 chip promises to speed up model training by up to six times and because this new architecture also features Nvidia's new NVLink Switch system for connecting multiple nodes, large server clusters powered by these chips will be able to scale up to support massive networks with less overhead. "The largest AI models can require months to train on today's computing platforms," Nvidia's Dave Salvator writes in today's announcement. AI, high performance computing and data analytics are growing in complexity with some models, like large language ones, reaching trillions of parameters.

gpu architecture, nvidia, superchip, (9 more...)

#artificialintelligence

Industry: Information Technology > Hardware (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Nvidia takes the wraps off Hopper, its latest GPU architecture

#artificialintelligenceMar-22-2022, 16:20:21 GMT

Did you miss a session at the Data Summit? After much speculation, Nvidia today at its March 2022 GTC event announced the Hopper GPU architecture, a line of graphics cards that the company says will accelerate the types of algorithms commonly used in data science. Named for Grace Hopper, the pioneering U.S. computer scientist, the new architecture succeeds Nvidia's Ampere architecture, which launched roughly two years ago. The first card in the Hopper lineup is the H100, containing 80 billion transistors and a component called the Transformer Engine that's designed to speed up specific categories of AI models. Another architectural highlight includes Nvidia's MIG technology, which allows an H100 to be partitioned into seven smaller, isolated instances to handle different types of jobs.

architecture, nvidia, precision, (15 more...)

#artificialintelligence

Country: Asia > Japan (0.05)

Industry: Information Technology > Hardware (1.00)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.30)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Nvidia Offers The Ultimate AI Learning Tool With Jetson Nano 2GB

#artificialintelligenceDec-3-2020, 21:09:36 GMT

The Nano 2GB is connected and the Nano 4GB is ... [ ] in the background. Nvidia has asserted its leadership in Artificial Intelligence (AI) with a GPU architecture that continues to evolve with the growing demands of both training and inferencing AI workloads. The latest Ampere architecture provided a huge jump in performance with a new architecture that also allows the GPU to be partitioned to act as seven individual inference engines. As a result of Ampere, Nvidia's own supercomputer Selene based on the DGX A100 server ranks fifth in the TOP500 supercomputers and number one in the Green500 supercomputers. However, Nvidia is focused on more than just extreme computing as demonstrated by its proposed acquisition of Arm. Even without acquiring Arm, Nvidia has been pushing the boundaries of the AI down to lower-power and small form factor applications.

architecture, developer kit, gpu architecture, (14 more...)

#artificialintelligence

Industry: Information Technology > Hardware (1.00)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Nvidia's bleeding-edge Ampere GPU architecture revealed: 5 things PC gamers need to know

PCWorldMay-15-2020, 10:28:05 GMT

Nearly a year and a half after the GeForce RTX 20-series launched with Nvidia's Turing architecture inside, and three years after the launch of the data center-focused Volta GPUs, CEO Jensen Huang unveiled graphics cards powered by the new Ampere architecture during a digital GTC 2020 keynote on Thursday morning. It looks like an absolute monster. Ampere debuts in the form of the A100, a humongous data center GPU powering Nvidia's new DGX-A100 systems. Make no mistake: This 6,912 CUDA core-packing beast targets data scientists, with internal hardware optimized around deep learning tasks. You won't be using it to play Cyberpunk 2077.

artificial intelligence, machine learning, nvidia, (15 more...)

PCWorld

Industry:

Information Technology > Hardware (0.95)
Leisure & Entertainment > Games > Computer Games (0.89)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.72)

Add feedback