AITopics | tensor core

Collaborating Authors

tensor core

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

8d35d225230a9d77b29c1dd300e48ad9-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 13:18:27 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.68)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications (0.93)

Add feedback

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Chen, Hongzheng, Fan, Bin, Collins, Alexander, Hagedorn, Bastian, Gaburov, Evghenii, Masuda, Masahiro, Brookhart, Matthew, Sullivan, Chris, Knight, Jason, Zhang, Zhiru, Grover, Vinod

arXiv.org Artificial IntelligenceDec-11-2025

Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.14719

Genre: Research Report (0.40)

Industry: Information Technology (0.51)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

MMA-Sim: Bit-Accurate Reference Model of Tensor Cores and Matrix Cores

Xie, Peichen, Wang, Yang, Yang, Fan, Yang, Mao

arXiv.org Artificial IntelligenceNov-17-2025

The rapidly growing computation demands of deep neural networks (DNNs) have driven hardware vendors to integrate matrix multiplication accelerators (MMAs), such as NVIDIA Tensor Cores and AMD Matrix Cores, into modern GPUs. However, due to distinct and undocumented arithmetic specifications for floating-point matrix multiplication, some MMAs can lead to numerical imprecision and inconsistency that can compromise the stability and reproducibility of DNN training and inference. This paper presents MMA-Sim, the first bit-accurate reference model that reveals the detailed arithmetic behaviors of the MMAs from ten GPU architectures (eight from NVIDIA and two from AMD). By dissecting the MMAs using a combination of targeted and randomized tests, our methodology derives nine arithmetic algorithms to simulate the floating-point matrix multiplication of the MMAs. Large-scale validation confirms bitwise equivalence between MMA-Sim and the real hardware. Using MMA-Sim, we investigate arithmetic behaviors that affect DNN training stability, and identify undocumented behaviors that could lead to significant errors.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.10909

Genre: Research Report (1.00)

Industry: Information Technology (0.57)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis

Kao, Jyun-Ping

arXiv.org Artificial IntelligenceNov-11-2025

Large-language models (LLMs) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference latency, which in turn demands powerful hardware. High-performance graphical processing units (GPUs) provide the necessary compute and memory throughput to run large LLMs on imaging data. We review modern GPU architectures (e.g. NVIDIA A100/H100, AMD Instinct MI250X/MI300) and key performance metrics of floating-point throughput, memory bandwidth, VRAM capacity. We show how these hardware capabilities affect radiology tasks: for example, generating reports or detecting findings on CheXpert and MIMIC-CXR images is computationally intensive and benefits from GPU parallelism and tensor-core acceleration. Empirical studies indicate that using appropriate GPU resources can reduce inference time and improve throughput. We discuss practical challenges including privacy, deployment, cost, power and optimization strategies: mixed-precision, quantization, compression, and multi-GPU scaling. Finally, we anticipate that next-generation features (8-bit tensor cores, enhanced interconnect) will further enable on-premise and federated radiology AI. Advancing GPU infrastructure is essential for safe, efficient LLM-based radiology diagnostics.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.16328

Country: Asia > Taiwan (0.05)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

8d35d225230a9d77b29c1dd300e48ad9-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 09:08:19 GMT

comera, contraction, neural network, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.68)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications (0.93)

Add feedback

Toward Lifelong-Sustainable Electronic-Photonic AI Systems via Extreme Efficiency, Reconfigurability, and Robustness

Yin, Ziang, Zhou, Hongjian, Sudarshan, Chetan Choppali, Chhabria, Vidya, Gu, Jiaqi

arXiv.org Artificial IntelligenceSep-10-2025

The relentless growth of large-scale artificial intelligence (AI) has created unprecedented demand for computational power, straining the energy, bandwidth, and scaling limits of conventional electronic platforms. Electronic-photonic integrated circuits (EPICs) have emerged as a compelling platform for next-generation AI systems, offering inherent advantages in ultra-high bandwidth, low latency, and energy efficiency for computing and interconnection. Beyond performance, EPICs also hold unique promises for sustainability. Fabricated in relaxed process nodes with fewer metal layers and lower defect densities, photonic devices naturally reduce embodied carbon footprint (CFP) compared to advanced digital electronic integrated circuits, while delivering orders-of-magnitude higher computing performance and interconnect bandwidth. To further advance the sustainability of photonic AI systems, we explore how electronic-photonic design automation (EPDA) and cross-layer co-design methodologies can amplify these inherent benefits. We present how advanced EPDA tools enable more compact layout generation, reducing both chip area and metal layer usage. We will also demonstrate how cross-layer device-circuit-architecture co-design unlocks new sustainability gains for photonic hardware: ultra-compact photonic circuit designs that minimize chip area cost, reconfigurable hardware topology that adapts to evolving AI workloads, and intelligent resilience mechanisms that prolong lifetime by tolerating variations and faults. By uniting intrinsic photonic efficiency with EPDA- and co-design-driven gains in area efficiency, reconfigurability, and robustness, we outline a vision for lifelong-sustainable electronic-photonic AI systems. This perspective highlights how EPIC AI systems can simultaneously meet the performance demands of modern AI and the urgent imperative for sustainable computing.

accelerator, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.07396

Country: North America > United States (0.68)

Genre: Research Report (0.64)

Industry:

Energy (1.00)
Information Technology (0.95)
Semiconductors & Electronics (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving

Hu, Huanqi, Xiao, Bowen, Sun, Shixuan, Yin, Jianian, Zhang, Zhexi, Luo, Xiang, Jiang, Chengquan, Xu, Weiqi, Jia, Xiaoying, Liu, Xin, Guo, Minyi

arXiv.org Artificial IntelligenceSep-3-2025

Quantization is a critical technique for accelerating LLM inference by reducing memory footprint and improving computational efficiency. Among various schemes, 4-bit weight and 8-bit activation quantization (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 GEMM kernels fall short in practice due to inefficient dequantization on CUDA Cores, which cannot keep pace with the high throughput of Tensor Cores. In this paper, we present LiquidGEMM, a hardware-efficient W4A8 GEMM kernel for efficient LLM serving. LiquidGEMM designs two key techniques: LiquidQuant, a hardware-efficient quantization method that enables fast, overflow-safe dequantization using just two arithmetic instructions per four elements; and an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and MMA across warp groups without software synchronization or redundant memory traffic. Experimental results show that LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup. Compared to various quantized GEMM kernels in NVIDIA TensorRT-LLM, LiquidGEMM delivers 1.12-1.63x performance gains, and achieves up to 1.63x system-level speedup.

large language model, machine learning, quantization, (18 more...)

arXiv.org Artificial Intelligence

2509.01229

Country:

Asia > China > Shanghai > Shanghai (0.05)
North America > United States > Washington > King County > Seattle (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Make the most of AI with iBUYPOWER's RDY Element Pro R01

PCWorldJul-24-2025, 21:03:44 GMT

AI is driving new capabilities across creative apps, productivity tools and gaming. Many of these features can now directly run on a laptop or desktop without the need for subscriptions or cloud access. PCs equipped with a GeForce RTX 50 Series GPU accelerate AI performance and unlock the best experience. Alongside DLSS 4 upscaling and multi frame generation, which boost performance and image quality in supported game titles, GeForce RTX 50 Series GPUs give you access to NVIDIA's suite of RTX AI features that can enhance your creative endeavors, give you free access to a capable (and private) chatbot, and help you enjoy streamed movie and TV at higher quality than ever before. NVIDIA has been at the forefront of AI innovation--from building AI factories to advancing the latest gaming and creative technologies on laptops and desktops. Tensor Cores, the dedicated hardware that accelerates AI processing, first launched with the GeForce RTX 20 Series in 2018 and have advanced every generation since then.

artificial intelligence, geforce rtx 50, natural language, (9 more...)

PCWorld

Technology: Information Technology > Artificial Intelligence > Natural Language (0.37)

Add feedback

TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations

Sedukhin, Stanislav, Tomioka, Yoichi, Matsumoto, Kazuya, Okuyama, Yuichi

arXiv.org Artificial IntelligenceJul-1-2025

Multilinear transformations are key in high-performance computing (HPC) and artificial intelligence (AI) workloads, where data is represented as tensors. However, their high computational and memory demands, which grow with dimensionality, often slow down critical tasks. Moreover, scaling computation by enlarging the number of parallel processing units substantially increases energy consumption, limiting widespread adoption, especially for sparse data, which is common in HPC and AI applications. This paper introduces the Trilinear Algorithm and isomorphic to algorithm Device Architecture (TriADA) to address these challenges with the following innovations: (1) a massively parallel, low-rank algorithm for computing a family of trilinear (3D) discrete orthogonal transformations (3D-DXTs), which is a special case of the more general 3-mode matrix-by-tensor multiplication (3D-GEMT); (2) a new outer-product-based GEMM kernel with decoupled streaming active memory, specially designed to accelerate 3D-GEMT operation; (3) an isomorphic to the proposed algorithm, fully distributed 3D network of mesh interconnected processing elements or cells with a coordinate-free, data-driven local processing activity, which is independent of problem size; (4) an elastic sparse outer-product (ESOP) method that avoids unnecessary computing and communication operations with zero-valued operands, thereby enhancing energy efficiency, computational accuracy, and stability. TriADA is capable of performing a variety of trilinear transformations with hypercubic arithmetic complexity in a linear number of time-steps. The massively parallel, scalable, and energy-efficient architecture of TriADA is ideal for accelerating multilinear tensor operations, which are the most demanding parts of AI and HPC workloads.

artificial intelligence, data quality, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2506.22818

Country:

Africa > Senegal > Kolda Region > Kolda (0.04)
Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
North America > Canada > Ontario > Toronto (0.04)
(14 more...)

Genre: Research Report (0.41)

Industry: Information Technology (1.00)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Data Science > Data Quality > Data Transformation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Fused3S: Fast Sparse Attention on Tensor Cores

Li, Zitong, Chandramowlishwaran, Aparna

arXiv.org Artificial IntelligenceMay-14-2025

Sparse attention is a core building block in many leading neural network models, from graph-structured learning to sparse sequence modeling. It can be decomposed into a sequence of three sparse matrix operations (3S): sampled dense-dense matrix multiplication (SDDMM), softmax normalization, and sparse matrix multiplication (SpMM). Efficiently executing the 3S computational pattern on modern GPUs remains challenging due to (a) the mismatch between unstructured sparsity and tensor cores optimized for dense operations, and (b) the high cost of data movement. Previous works have optimized these sparse operations individually or addressed one of these challenges. This paper introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement. Across real-world graph datasets, Fused3S achieves $1.6- 16.3\times$ and $1.5-14\times$ speedup over state-of-the-art on H100 and A30 GPUs. Furthermore, integrating Fused3S into Graph Transformer inference accelerates end-to-end performance by $1.05-5.36\times$, consistently outperforming all 3S baselines across diverse datasets (single and batched graphs) and GPU architectures.

artificial intelligence, machine learning, thread block, (17 more...)

arXiv.org Artificial Intelligence

2505.08098

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.14)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.06)
North America > United States > New York > New York County > New York City (0.04)
(7 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback