AITopics | gemm

Firstly,when all the operands (i.e., weights, activations, errors and gradients) for general matrix multiplication (GEMM) and convolution computations are reduced to 8 bits, most DNNs suffer noticeable accuracy degradation (e.g., Figure 1(a)).

accumulation, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

PrivCirNet: Efficient Private Inference via Block Circulant Transformation

Neural Information Processing SystemsFeb-18-2026, 04:02:01 GMT

Homomorphic encryption (HE)-based deep neural network (DNN) inference protects data and model privacy but suffers from significant computation overhead. We observe transforming the DNN weights into circulant matrices converts general matrix-vector multiplications into HE-friendly 1-dimensional convolutions, drastically reducing the HE computation cost.

artificial intelligence, machine learning, privcirnet, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

7ede97c3e082c6df10a8d6103a2eebd2-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 03:15:07 GMT

arxiv preprint arxiv, lash, wgmma, (15 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

76e952a4e83d97186d3f55eef6a3a367-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 22:39:19 GMT

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre: Research Report (0.93)

Industry:

Education (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

13b919438259814cd5be8cb45877d577-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 13:33:58 GMT

gemm, gradient, weight and activation, (10 more...)

Neural Information Processing Systems

Country: North America > United States (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Metere, Alfredo

arXiv.org Artificial IntelligenceNov-25-2025

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.18674

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Training Deep Neural Networks with 8-bit Floating Point Numbers

Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, Kailash Gopalakrishnan

Neural Information Processing SystemsNov-20-2025, 15:36:29 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, deep learning, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

Kiyawat, Khyati, Fan, Zhenxing, Seneviratne, Yasas, Baradaran, Morteza, Shekar, Akhil, Xia, Zihan, Kang, Mingu, Skadron, Kevin

arXiv.org Artificial IntelligenceNov-18-2025

Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The logic chiplets sustain high bandwidth access to the DRAM chiplets, which house the memory banks, and enable the integration of advanced processing components such as systolic arrays and SRAM-based buffers to accelerate memory-bound GEMM kernels, capabilities that were not feasible in prior PIM architectures. We propose Sangam, a CXL-attached PIM-chiplet based memory module that can either act as a drop-in replacement for GPUs or co-executes along side the GPUs. Sangam achieves speedup of 3.93, 4.22, 2.82x speedup in end-to-end query latency, 10.3, 9.5, 6.36x greater decoding throughput, and order of magnitude energy savings compared to an H100 GPU for varying input size, output length, and batch size on LLaMA 2-7B, Mistral-7B, and LLaMA 3-70B, respectively.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2511.12286

Country: North America > United States > California (0.46)

Genre: Research Report (0.50)

Industry:

Information Technology (0.93)
Energy (0.68)
Semiconductors & Electronics (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HipKittens: Fast and Furious AMD Kernels

Hu, William, Wadsworth, Drew, Siddens, Sean, Winata, Stanley, Fu, Daniel Y., Swann, Ryann, Osama, Muhammad, Ré, Christopher, Arora, Simran

arXiv.org Artificial IntelligenceNov-12-2025

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.08083

Country: North America > United States > California (0.28)

Genre: Research Report (0.82)

Industry:

Semiconductors & Electronics (0.92)
Information Technology > Hardware (0.57)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Hardware (0.89)

Add feedback

PrivCirNet: Efficient Private Inference via Block Circulant Transformation

Neural Information Processing SystemsOct-10-2025, 16:39:31 GMT

Homomorphic encryption (HE)-based deep neural network (DNN) inference protects data and model privacy but suffers from significant computation overhead. We observe transforming the DNN weights into circulant matrices converts general matrix-vector multiplications into HE-friendly 1-dimensional convolutions, drastically reducing the HE computation cost.

algorithm, convolution, privcirnet, (16 more...)

Neural Information Processing Systems

Country: