gpu architecture
ProbSelect: Stochastic Client Selection for GPU-Accelerated Compute Devices in the 3D Continuum
Stanisic, Andrija, Nastic, Stefan
Abstract--Integration of edge, cloud and space devices into a unified 3D continuum imposes significant challenges for client selection in federated learning systems. Traditional approaches rely on continuous monitoring and historical data collection, which becomes impractical in dynamic environments where satellites and mobile devices frequently change operational conditions. Furthermore, existing solutions primarily consider CPU-based computation, failing to capture complex characteristics of GPU-accelerated training that is prevalent across the 3D continuum. This paper introduces ProbSelect, a novel approach utilizing analytical modeling and probabilistic forecasting for client selection on GPU-accelerated devices, without requiring historical data or continuous monitoring. Extensive evaluation across diverse GPU architectures and workloads demonstrates that ProbSelect improves SLO compliance by 13.77% on average while achieving 72.5% computational waste reduction compared to baseline approaches.
- North America > United States > Virginia (0.04)
- Europe (0.04)
- Information Technology > Hardware (1.00)
- Information Technology > Graphics (1.00)
- Information Technology > Communications > Networks (1.00)
- (3 more...)
The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis
Large-language models (LLMs) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference latency, which in turn demands powerful hardware. High-performance graphical processing units (GPUs) provide the necessary compute and memory throughput to run large LLMs on imaging data. We review modern GPU architectures (e.g. NVIDIA A100/H100, AMD Instinct MI250X/MI300) and key performance metrics of floating-point throughput, memory bandwidth, VRAM capacity. We show how these hardware capabilities affect radiology tasks: for example, generating reports or detecting findings on CheXpert and MIMIC-CXR images is computationally intensive and benefits from GPU parallelism and tensor-core acceleration. Empirical studies indicate that using appropriate GPU resources can reduce inference time and improve throughput. We discuss practical challenges including privacy, deployment, cost, power and optimization strategies: mixed-precision, quantization, compression, and multi-GPU scaling. Finally, we anticipate that next-generation features (8-bit tensor cores, enhanced interconnect) will further enable on-premise and federated radiology AI. Advancing GPU infrastructure is essential for safe, efficient LLM-based radiology diagnostics.
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm
Zhou, Qirui, Peng, Shaohui, Xiong, Weiqiang, Chen, Haixin, Wen, Yuanbo, Li, Haochen, Li, Ling, Guo, Qi, Zhao, Yongwei, Gao, Ke, Chen, Ruizhi, Wu, Yanjun, Zhao, Chen, Chen, Yunji
The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.
SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation
Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
- Europe > Greece > Attica > Athens (0.06)
- North America > United States > New York > New York County > New York City (0.05)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Optimistic Verifiable Training by Controlling Hardware Nondeterminism
Srivastava, Megha, Arora, Simran, Boneh, Dan
The increasing compute demands of AI systems has led to the emergence of services that train models on behalf of clients lacking necessary resources. However, ensuring correctness of training and guarding against potential training-time attacks, such as data poisoning, poses challenges. Existing works on verifiable training largely fall into two classes: proof-based systems, which struggle to scale due to requiring cryptographic techniques, and "optimistic" methods that consider a trusted third-party auditor who replicates the training process. A key challenge with the latter is that hardware nondeterminism between GPU types during training prevents an auditor from replicating the training process exactly, and such schemes are therefore non-robust. We propose a method that combines training in a higher precision than the target model, rounding after intermediate computation steps, and storing rounding decisions based on an adaptive thresholding procedure, to successfully control for nondeterminism. Across three different NVIDIA GPUs (A40, Titan XP, RTX 2080 Ti), we achieve exact training replication at FP32 precision for both full-training and fine-tuning of ResNet-50 (23M) and GPT-2 (117M) models. Our verifiable training scheme significantly decreases the storage and time costs compared to proof-based systems.
- North America > United States > New Mexico (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems
Randall, Thomas, Allen, Tyler, Ge, Rong
Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy (0.04)
Minuet: Accelerating 3D Sparse Convolutions on GPUs
Yang, Jiacheng, Giannoula, Christina, Wu, Jun, Elhoushi, Mostafa, Gleeson, James, Pekhimenko, Gennady
Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary General Matrix Multiplication (GEMM) operations to be executed (Map step), and then use a Gather-GEMM-Scatter process to execute these GEMM operations (GMaS step). In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs. Minuet proposes to (i) replace the hash tables used in the Map step with a novel segmented sorting double-traversed binary search algorithm that highly utilizes the on-chip memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile size in the Gather and Scatter operations of the GMaS step, such that to adapt the execution to the particular characteristics of each SC layer, dataset, and GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach that reduces both memory padding and kernel launching overheads. Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74\times$ (up to $2.22\times$) for end-to-end point cloud network executions. Our novel segmented sorting double-traversed binary search algorithm achieves superior speedups by $15.8\times$ on average (up to $26.8\times$) over prior SC engines in the Map step. The source code of Minuet is publicly available at https://github.com/UofT-EcoSystem/Minuet.
- North America > Canada > Ontario > Toronto (0.28)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
Nvidia launches a new GPU architecture and the Grace CPU Superchip – TechCrunch
At its annual GTC conference for AI developers, Nvidia today announced its next-gen Hopper GPU architecture and the Hopper H100 GPU, as well as a new data center chip that combines the GPU with a high-performance CPU, which Nvidia calls the "Grace CPU Superchip" (not to be confused with the Grace Hopper Superchip). With Hopper, Nvidia is launching a number of new and updated technologies, but for AI developers, the most important one may just be the architecture's focus on transformer models, which have become the machine learning technique de rigueur for many use cases and which powers models like GPT-3 and asBERT. The new Transformer Engine in the H100 chip promises to speed up model training by up to six times and because this new architecture also features Nvidia's new NVLink Switch system for connecting multiple nodes, large server clusters powered by these chips will be able to scale up to support massive networks with less overhead. "The largest AI models can require months to train on today's computing platforms," Nvidia's Dave Salvator writes in today's announcement. AI, high performance computing and data analytics are growing in complexity with some models, like large language ones, reaching trillions of parameters.
Nvidia takes the wraps off Hopper, its latest GPU architecture
Did you miss a session at the Data Summit? After much speculation, Nvidia today at its March 2022 GTC event announced the Hopper GPU architecture, a line of graphics cards that the company says will accelerate the types of algorithms commonly used in data science. Named for Grace Hopper, the pioneering U.S. computer scientist, the new architecture succeeds Nvidia's Ampere architecture, which launched roughly two years ago. The first card in the Hopper lineup is the H100, containing 80 billion transistors and a component called the Transformer Engine that's designed to speed up specific categories of AI models. Another architectural highlight includes Nvidia's MIG technology, which allows an H100 to be partitioned into seven smaller, isolated instances to handle different types of jobs.
Nvidia Offers The Ultimate AI Learning Tool With Jetson Nano 2GB
The Nano 2GB is connected and the Nano 4GB is ... [ ] in the background. Nvidia has asserted its leadership in Artificial Intelligence (AI) with a GPU architecture that continues to evolve with the growing demands of both training and inferencing AI workloads. The latest Ampere architecture provided a huge jump in performance with a new architecture that also allows the GPU to be partitioned to act as seven individual inference engines. As a result of Ampere, Nvidia's own supercomputer Selene based on the DGX A100 server ranks fifth in the TOP500 supercomputers and number one in the Green500 supercomputers. However, Nvidia is focused on more than just extreme computing as demonstrated by its proposed acquisition of Arm. Even without acquiring Arm, Nvidia has been pushing the boundaries of the AI down to lower-power and small form factor applications.
- Information Technology > Hardware (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)