rtx 4090
Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration
Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention
Oh, Hyunwoo, Chen, Hanning, Yun, Sanggeon, Ni, Yang, Huang, Wenjun, Das, Tamoghno, Jang, Suyeon, Imani, Mohsen
Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- (13 more...)
Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
Hong, Ke, Li, Xiuhong, Liu, Minxu, Mao, Qiuli, Wu, Tianqi, Huang, Zixiao, Chen, Lufang, Wang, Zhong, Zhang, Yichong, Zhu, Zhenhua, Dai, Guohao, Wang, Yu
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency becomes an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, which utilizes a novel signaling mechanism: when part of the output finishes, the computation kernel sends a signal to trigger the communication of that part, while continuing the computation of the remaining part (interference-free computation). Consequently, the communication of the finished part and the computation of the remaining part can be overlapped. On top of the signaling mechanism, FlashOverlap comprises two key components: (1) the determination of the signaling timing to boost the overlap efficiency (tile-wise overlapping), and (2) a pre-communication reordering to create the contiguous address for finished data, enabling communication by simply calling NCCL APIs (communication agnosticism), and a post-communication reordering to correct the data order. Experiments show that FlashOverlap achieves up to 1.65x speedup through overlap, outperforming existing works in most cases. Code is available at https://github.com/infinigence/FlashOverlap.
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.05)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- (2 more...)
Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training
Liu, Zeyu, Li, Yan, Zhang, Yunquan, Zhang, Boyang, Jiang, Guoyong, Zhang, Xin, Xiao, Limin, Zhang, Weifeng, Cheng, Daning
Training large language models typically demands extensive GPU memory and substantial financial investment, which poses a barrier for many small- to medium-sized teams. In this paper, we propose a full-parameter pre-training and fine-tuning framework based on block coordinate descent (BCD), enhanced with engineering optimizations, to enable efficient training of large-scale models on cost-effective RTX 4090, A100 and A800 GPU clusters. Under identical hardware configurations, we reduce the training cost of a 7B model to 33% on A100/A800 and only 2.6% on RTX 4090, compared to standard full-parameter training. It also enables large models previously restricted to A100 clusters to be trained on RTX 4090 without degrading performance. BCD achieves comparable or better accuracy than full-parameter and fine-tuning methods at most cases, with lower GPU consumption and improved hardware utilization.
LatPhon: Lightweight Multilingual G2P for Romance Languages and English
Chary, Luis Felipe, Ramirez, Miguel Arjona
Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech (TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST) and alignment systems, especially across multiple Latin-script languages.We present LatPhon, a 7.5 M - parameter Transformer jointly trained on six such languages--English, Spanish, French, Italian, Portuguese, and Romanian. On the public ipa-dict corpus, it attains a mean phoneme error rate (PER) of 3.5%, outperforming the byte-level ByT5 baseline (5.4%) and approaching language-specific WFSTs (3.2%) while occupying 30 MB of memory, which makes on-device deployment feasible when needed. These results indicate that compact multilingual G2P can serve as a universal front-end for Latin-language speech pipelines.
- South America > Brazil (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
Ma, Shaobo, Fang, Chao, Shao, Haikuo, Wang, Zhongfeng
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.
TOP: Time Optimization Policy for Stable and Accurate Standing Manipulation with Humanoid Robots
Chen, Zhenghan, Xu, Haocheng, Zhang, Haodong, Zhang, Liang, Li, He, Wang, Dongqi, Yu, Jiyu, Yang, Yifei, Zhou, Zhongxiang, Xiong, Rong
-- Humanoid robots have the potential capability to perform a diverse range of manipulation tasks, but this is based on a robust and precise standing controller . Existing methods are either ill-suited to precisely control high-dimensional upper-body joints, or difficult to ensure both robustness and accuracy, especially when upper-body motions are fast. This paper proposes a novel time optimization policy (TOP), to train a standing manipulation control model that ensures balance, precision, and time efficiency simultaneously, with the idea of adjusting the time trajectory of upper-body motions but not only strengthening the disturbance resistance of the lower-body. Our approach consists of three parts. Firstly, we utilize motion prior to represent upper-body motions to enhance the coordination ability between the upper and lower-body by training a variational autoencoder (V AE). Then we decouple the whole-body control into an upper-body PD controller for precision and a lower-body RL controller to enhance robust stability. Finally, we train TOP method in conjunction with the decoupled controller and V AE to reduce the balance burden resulting from fast upper-body motions that would destabilize the robot and exceed the capabilities of the lower-body RL policy. The effectiveness of the proposed approach is evaluated via both simulation and real world experiments, which demonstrate the superiority on standing manipulation tasks stably and accurately. The project page can be found at https://anonymous.4open.science/w/top-258F/. I. INTRODUCTION Humanoid robots are the most potential embodied agents for the purpose of liberating human-level labors, as they are designed to perform anthropomorphic motions and various whole-body loco-manipulation tasks, including industrial parts assembly, home service, etc.[1]. Their anthropomorphism naturally makes them more suitable than other specific robots to interact with environments, objects and humans to complete various physical tasks. Although rapid growth has been achieved in the field of humanoid robots[2], it remains a challenge to execute various intricate tasks while maintaining balance and precision simultaneously due to the intrinsic instability characteristic of humanoid robot. Existing methods can be broadly divided into two paradigms: whole-body controllers[3, 4, 5] and upper and lower-body decoupled controllers[6, 7]. Rong Xiong is the corresponding author.
MSI Raider 18 HX AI review: A benchmark-breaking beast in laptop form
The MSI Raider 18 HX AI isn't a looker, but it packs incredible CPU and GPU performance. The MSI Raider 18 HX AI is the very model of a "desktop replacement" laptop. It's big, it's not much to look at, and it has a mediocre touchpad that implies users are really expected to connect a mouse. That might leave some shoppers asking, "What's the point?" That question is answered once the laptop is tossed into a demanding game or application. It might be thick, but the MSI Raider 18 HX delivers top-tier CPU and GPU performance. It even has gobs of RAM and a PCIe 5.0 solid state drive.
A Nonlinear Hash-based Optimization Method for SpMV on GPUs
Yan, Chen, Diao, Boyu, Liu, Hangda, An, Zhulin, Xu, Yongjun
A Nonlinear Hash-based Optimization Method for SpMV on GPUs Chen Y an a,b, Boyu Diao a,b, Hangda Liu a,b, Zhulin An a,b and Y ongjun Xu a,b a Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China b University of Chinese Academy of Sciences, Beijing, China {yanchen23s, diaoboyu2012, liuhangda21s, anzhulin, xyj } @ict.ac.cn Abstract --Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection. I NTRODUCTION Sparse matrix-vector multiplication (SpMV) has a wide range of applications, such as mathematical solutions for sparse linear equations [13], iterative algorithm-solving processing [15] [25], graph processing [9] [14] [24], and weight calculations for forward and backward propagation in neural networks [3] [12] [17] [19], etc. However, SpMV is actually the bottleneck for many algorithms. The sparse matrix used in SpMV has the following characteristics [4]: (1) Sparsity. On the one hand, sparse matrices contain a large number of zero elements.
- Asia > China > Beijing > Beijing (0.44)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.55)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models
Zhao, Pengxiang, Yuan, Xiaoming
Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$\times$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.
- Asia > China > Hong Kong (0.04)
- North America > United States > New Jersey (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)