rtx 3090
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image.
6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data
Sinha, Saptarshi Neil, Kühn, Julius, Goschke, Mika Silvan, Weinmann, Michael
Automated and selective harvesting of fruits has become an important area of research, particularly due to challenges such as high costs and a shortage of seasonal labor in advanced economies. This paper focuses on 6D pose estimation of strawberries using purely synthetic data generated through a procedural pipeline for photorealistic rendering. W e employ the YOLOX-6D-Pose algorithm, a single-shot approach that leverages the YOLOX backbone, known for its balance between speed and accuracy, and its support for edge inference. T o address the lacking availability of training data, we introduce a robust and flexible pipeline for generating synthetic strawberry data from various 3D models via a procedural Blender pipeline, where we focus on enhancing the realism of the synthesized data in comparison to previous work to make it a valuable resource for training pose estimation algorithms. Quantitative evaluations indicate that our models achieve comparable accuracy on both the NVIDIA RTX 3090 and Jetson Orin Nano across several ADD-S metrics, with the RTX 3090 demonstrating superior processing speed. However, the Jetson Orin Nano is particularly suited for resource-constrained environments, making it an excellent choice for deployment in agricultural robotics. Qualitative assessments further confirm the model's performance, demonstrating its capability to accurately infer the poses of ripe and partially ripe strawberries, while facing challenges in detecting unripe specimens. This suggests opportunities for future improvements, especially in enhancing detection capabilities for unripe strawberries (if desired) by exploring variations in color . Furthermore, the methodology presented could be adapted easily for other fruits such as apples, peaches, and plums, thereby expanding its applicability and impact in the field of agricultural automation.
- North America > United States (0.04)
- Europe > Netherlands > South Holland > Delft (0.04)
- Asia > China (0.04)
- Food & Agriculture > Agriculture (0.46)
- Leisure & Entertainment (0.36)
- Information Technology (0.34)
PINGS: Physics-Informed Neural Network for Fast Generative Sampling
Prasha, Achmad Ardani, Rachmadi, Clavino Ourizqi, Syahlan, Muhamad Fauzan Ibnu, Anugerah, Naufal Rahfi, Raditya, Nanda Garin, Amelia, Putri, Mutiara, Sabrina Laila, Ramadhan, Hilman Syachr
We introduce PINGS (Physics-Informed Neural Network for Fast Generative Sampling), a framework that amortizes diffusion sampling by training a physics-informed network to approximate reverse-time probability-flow dynamics, reducing sampling to a single forward pass (NFE = 1). As a proof of concept, we learn a direct map from a 3D standard normal to a non-Gaussian Gaussian Mixture Model (GMM). PINGS preserves the target's distributional structure (multi-bandwidth kernel $MMD^2 = 1.88 \times 10^{-2}$ with small errors in mean, covariance, skewness, and excess kurtosis) and achieves constant-time generation: $10^4$ samples in $16.54 \pm 0.56$ millisecond on an RTX 3090, versus 468-843 millisecond for DPM-Solver (10/20) and 960 millisecond for DDIM (50) under matched conditions. We also sanity-check the PINN/automatic-differentiation pipeline on a damped harmonic oscillator, obtaining MSEs down to $\mathcal{O}(10^{-5})$. Compared to fast but iterative ODE solvers and direct-map families (Flow, Rectified-Flow, Consistency), PINGS frames generative sampling as a PINN-style residual problem with endpoint anchoring, yielding a white-box, differentiable map with NFE = 1. These proof-of-concept results position PINGS as a promising route to fast, function-based generative sampling with potential extensions to scientific simulation (e.g., fast calorimetry).
APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
Ma, Shaobo, Fang, Chao, Shao, Haikuo, Wang, Zhongfeng
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.
FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization
Wang, Aotao, Shao, Haikuo, Ma, Shaobo, Wang, Zhongfeng
State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the nonlinear functions. Based on the accurate algorithm quantization, we propose an accelerator that integrates parallel vector processing units, pipelined execution dataflow, and an efficient SSM Nonlinear Approximation Unit, which enhances computational efficiency and reduces hardware complexity. Finally, we evaluate FastMamba on Xilinx VC709 FPGA. For the input prefill task on Mamba2-130M, FastMamba achieves 68.80\times and 8.90\times speedup over Intel Xeon 4210R CPU and NVIDIA RTX 3090 GPU, respectively. In the output decode experiment with Mamba2-2.7B, FastMamba attains 6\times higher energy efficiency than RTX 3090 GPU.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology (0.49)
- Semiconductors & Electronics (0.35)
- Information Technology > Hardware (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb
Giasemis, Fotis I., Lončar, Vladimir, Granado, Bertrand, Gligorov, Vladimir Vava
In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing charged particle tracks, due to its potentially linear computational scaling with detector hits. The recent implementation of a graph neural network-based track reconstruction pipeline in the first level trigger of the LHCb experiment on GPUs serves as a platform for comparative studies between computational architectures in the context of high-energy physics. This paper presents a novel comparison of the throughput of ML model inference between FPGAs and GPUs, focusing on the first step of the track reconstruction pipeline$\unicode{x2013}$an implementation of a multilayer perceptron. Using HLS4ML for FPGA deployment, we benchmark its performance against the GPU implementation and demonstrate the potential of FPGAs for high-throughput, low-latency inference without the need for an expertise in FPGA development and while consuming significantly less power.
- Europe > France > Île-de-France > Paris > Paris (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Research Report (0.64)
- Workflow (0.49)
- Information Technology (0.50)
- Energy > Power Industry (0.46)
Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems
Zhu, Yu, Jiang, Wenqi, Alonso, Gustavo
Keeping ML-based recommender models up-to-date as data drifts and evolves is essential to maintain accuracy. As a result, online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. Such an approach results in high deployment costs and energy consumption. For instance, a recent report from industrial deployments shows that data storage and ingestion pipelines can account for over 60\% of the power consumption in a recommender system. In this paper, we tackle the issue from a hardware perspective by introducing Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion. As part of the design, we define MiniPipe, the smallest pipeline unit enabling multi-pipeline implementation by executing various data preprocessing tasks across the single board, giving Piper the ability to be reconfigured at runtime. Our results, using publicly released commercial pipelines, show that Piper, prototyped on a power-efficient FPGA, achieves a 39$\sim$105$\times$ speedup over a server-grade, 128-core CPU and 3$\sim$17$\times$ speedup over GPUs like RTX 3090 and A100 in multiple pipelines. The experimental analysis demonstrates that Piper provides advantages in both latency and energy efficiency for preprocessing tasks in recommender systems, providing an alternative design point for systems that today are in very high demand.
- Information Technology > Services (1.00)
- Energy > Oil & Gas > Midstream (0.34)
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions.
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
Khandelwal, Apoorv, Yun, Tian, Nayak, Nihal V., Merullo, Jack, Bach, Stephen H., Sun, Chen, Pavlick, Ellie
Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can't pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: https://github.com/apoorvkh/academic-pretraining.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Middle East (0.04)
- Asia > Middle East > Oman (0.04)
- Africa > Middle East (0.04)
- Questionnaire & Opinion Survey (0.93)
- Research Report > New Finding (0.48)
ProTrain: Efficient LLM Training via Memory-Aware Techniques
Yang, Hanmei, Zhou, Jin, Fu, Yao, Wang, Xiaoqun, Roane, Ramine, Guan, Hui, Liu, Tongping
It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the SOTA training systems.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > New York > New York County > New York City (0.04)