Gu, Qingyi
CacheQuant: Comprehensively Accelerated Diffusion Models
Liu, Xuewen, Li, Zhikai, Gu, Qingyi
Diffusion models have gradually gained prominence in the field of image synthesis, showcasing remarkable generative capabilities. Nevertheless, the slow inference and complex networks, resulting from redundancy at both temporal and structural levels, hinder their low-latency applications in real-world scenarios. Current acceleration methods for diffusion models focus separately on temporal and structural levels. However, independent optimization at each level to further push the acceleration limits results in significant performance degradation. On the other hand, integrating optimizations at both levels can compound the acceleration effects. Unfortunately, we find that the optimizations at these two levels are not entirely orthogonal. Performing separate optimizations and then simply integrating them results in unsatisfactory performance. To tackle this issue, we propose CacheQuant, a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques. Specifically, we employ a dynamic programming approach to determine the optimal cache schedule, in which the properties of caching and quantization are carefully considered to minimize errors. Additionally, we propose decoupled error correction to further mitigate the coupled and accumulated errors step by step. Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score. Our code are open-sourced: https://github.com/BienLuky/CacheQuant .
Efficient Event-based Semantic Segmentation with Spike-driven Lightweight Transformer-based Networks
Zhu, Xiaxin, Guo, Fangming, Long, Xianlei, Gu, Qingyi, Chen, Chao, Gu, Fuqiang
Event-based semantic segmentation has great potential in autonomous driving and robotics due to the advantages of event cameras, such as high dynamic range, low latency, and low power cost. Unfortunately, current artificial neural network (ANN)-based segmentation methods suffer from high computational demands, the requirements for image frames, and massive energy consumption, limiting their efficiency and application on resource-constrained edge/mobile platforms. To address these problems, we introduce SLTNet, a spike-driven lightweight transformer-based network designed for event-based semantic segmentation. Specifically, SLTNet is built on efficient spike-driven convolution blocks (SCBs) to extract rich semantic features while reducing the model's parameters. Then, to enhance the long-range contextural feature interaction, we propose novel spike-driven transformer blocks (STBs) with binary mask operations. Based on these basic blocks, SLTNet employs a high-efficiency single-branch architecture while maintaining the low energy consumption of the Spiking Neural Network (SNN). Finally, extensive experiments on DDD17 and DSEC-Semantic datasets demonstrate that SLTNet outperforms state-of-the-art (SOTA) SNN-based methods by at least 7.30% and 3.30% mIoU, respectively, with extremely 5.48x lower energy consumption and 1.14x faster inference speed.
TTAQ: Towards Stable Post-training Quantization in Continuous Domain Adaptation
Xiao, Junrui, Li, Zhikai, Yang, Lianwei, Mei, Yiduo, Gu, Qingyi
Post-training quantization (PTQ) reduces excessive hardware cost by quantizing full-precision models into lower bit representations on a tiny calibration set, without retraining. Despite the remarkable progress made through recent efforts, traditional PTQ methods typically encounter failure in dynamic and ever-changing real-world scenarios, involving unpredictable data streams and continual domain shifts, which poses greater challenges. In this paper, we propose a novel and stable quantization process for test-time adaptation (TTA), dubbed TTAQ, to address the performance degradation of traditional PTQ in dynamically evolving test domains. To tackle domain shifts in quantizer, TTAQ proposes the Perturbation Error Mitigation (PEM) and Perturbation Consistency Reconstruction (PCR). Specifically, PEM analyzes the error propagation and devises a weight regularization scheme to mitigate the impact of input perturbations. On the other hand, PCR introduces consistency learning to ensure that quantized models provide stable predictions for same sample. Furthermore, we introduce Adaptive Balanced Loss (ABL) to adjust the logits by taking advantage of the frequency and complexity of the class, which can effectively address the class imbalance caused by unpredictable data streams during optimization. Extensive experiments are conducted on multiple datasets with generic TTA methods, proving that TTAQ can outperform existing baselines and encouragingly improve the accuracy of low bit PTQ models in continually changing test domains. For instance, TTAQ decreases the mean error of 2-bit models on ImageNet-C dataset by an impressive 10.1\%.
DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation
Liu, Xuewen, Li, Zhikai, Gu, Qingyi
Diffusion models have shown excellent performance on various image generation tasks, but the substantial computational costs and huge memory footprint hinder their low-latency applications in real-world scenarios. Quantization is a promising way to compress and accelerate models. Nevertheless, due to the wide range and time-varying activations in diffusion models, existing methods cannot maintain both accuracy and efficiency simultaneously for low-bit quantization. To tackle this issue, we propose DilateQuant, a novel quantization framework for diffusion models that offers comparable accuracy and high efficiency. Specifically, we keenly aware of numerous unsaturated in-channel weights, which can be cleverly exploited to reduce the range of activations without additional computation cost. Based on this insight, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through a mathematically equivalent scaling. WD costlessly absorbs the activation quantization errors into weight quantization. The range of activations decreases, which makes activations quantization easy. The range of weights remains constant, which makes model easy to converge in training stage. Considering the temporal network leads to time-varying activations, we design a Temporal Parallel Quantizer (TPQ), which sets time-step quantization parameters and supports parallel quantization for different time steps, significantly improving the performance and reducing time cost. To further enhance performance while preserving efficiency, we introduce a Block-wise Knowledge Distillation (BKD) to align the quantized models with the full-precision models at a block level. The simultaneous training of time-step quantization parameters and weights minimizes the time required, and the shorter backpropagation paths decreases the memory footprint of the quantization process.
Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare
Li, Zhikai, Zhang, Jing, Gu, Qingyi
The disparity in healthcare personnel expertise and medical resources across different regions of the world is a pressing social issue. Artificial intelligence technology offers new opportunities to alleviate this issue. Segment Anything Model (SAM), which excels in intelligent image segmentation, has demonstrated exceptional performance in medical monitoring and assisted diagnosis. Unfortunately, the huge computational and storage overhead of SAM poses significant challenges for deployment on resource-limited edge devices. Quantization is an effective solution for model compression; however, traditional methods rely heavily on original data for calibration, which raises widespread concerns about medical data privacy and security. In this paper, we propose a data-free quantization framework for SAM, called DFQ-SAM, which learns and calibrates quantization parameters without any original data, thus effectively preserving data privacy during model compression. Specifically, we propose pseudo-positive label evolution for segmentation, combined with patch similarity, to fully leverage the semantic and distribution priors in pre-trained models, which facilitates high-quality data synthesis as a substitute for real data. Furthermore, we introduce scale reparameterization to ensure the accuracy of low-bit quantization. We perform extensive segmentation experiments on various datasets, and DFQ-SAM consistently provides significant performance on low-bit quantization. DFQ-SAM eliminates the need for data transfer in cloud-edge collaboration, thereby protecting sensitive data from potential attacks. It enables secure, fast, and personalized healthcare services at the edge, which enhances system efficiency and optimizes resource allocation, and thus facilitating the pervasive application of artificial intelligence in worldwide healthcare.
A Novel Wide-Area Multiobject Detection System with High-Probability Region Searching
Long, Xianlei, Zhao, Hui, Chen, Chao, Gu, Fuqiang, Gu, Qingyi
In recent years, wide-area visual surveillance systems have been widely applied in various industrial and transportation scenarios. These systems, however, face significant challenges when implementing multi-object detection due to conflicts arising from the need for high-resolution imaging, efficient object searching, and accurate localization. To address these challenges, this paper presents a hybrid system that incorporates a wide-angle camera, a high-speed search camera, and a galvano-mirror. In this system, the wide-angle camera offers panoramic images as prior information, which helps the search camera capture detailed images of the targeted objects. This integrated approach enhances the overall efficiency and effectiveness of wide-area visual detection systems. Specifically, in this study, we introduce a wide-angle camera-based method to generate a panoramic probability map (PPM) for estimating high-probability regions of target object presence. Then, we propose a probability searching module that uses the PPM-generated prior information to dynamically adjust the sampling range and refine target coordinates based on uncertainty variance computed by the object detector. Finally, the integration of PPM and the probability searching module yields an efficient hybrid vision system capable of achieving 120 fps multi-object search and detection. Extensive experiments are conducted to verify the system's effectiveness and robustness.
LLM Inference Unveiled: Survey and Roofline Model Insights
Yuan, Zhihang, Shang, Yuzhang, Zhou, Yang, Dong, Zhen, Zhou, Zhe, Xue, Chenhao, Wu, Bingzhe, Li, Zhikai, Gu, Qingyi, Lee, Yong Jae, Yan, Yan, Chen, Beidi, Sun, Guangyu, Keutzer, Kurt
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems, such as why LLMs are memory-bound, how much memory and computation they need, and how to choose the right hardware. We systematically collate the latest advancements in efficient LLM inference, covering crucial areas such as model compression (e.g., Knowledge Distillation and Quantization), algorithm improvements (e.g., Early Exit and Mixture-of-Expert), and both hardware and system-level enhancements. Our survey stands out by analyzing these methods with roofline model, helping us understand their impact on memory access and computation. This distinctive approach not only showcases the current research landscape but also delivers valuable insights for practical implementation, positioning our work as an indispensable resource for researchers new to the field as well as for those seeking to deepen their understanding of efficient LLM deployment. The analyze tool, LLM-Viewer, is open-sourced.
A Novel Dynamic Light-Section 3D Reconstruction Method for Wide-Range Sensing
Chen, Mengjuan, Li, Qing, Shimasaki, Kohei, Hu, Shaopeng, Gu, Qingyi, Ishii, Idaku
Existing galvanometer-based laser scanning systems are challenging to apply in multi-scale 3D reconstruction because of the difficulty in achieving a balance between high reconstruction accuracy and a wide reconstruction range. This paper presents a novel method that synchronizes laser scanning by switching the field-of-view (FOV) of a camera using multi-galvanometers. In addition to the advanced hardware setup, we establish a comprehensive mathematical model of the system by modeling dynamic camera, dynamic laser, and their combined interaction. We then propose a high-precision and flexible calibration method by constructing an error model and minimizing the objective function. Finally, we evaluate the performance of the proposed system by scanning standard components. The evaluation results demonstrate that the accuracy of the proposed 3D reconstruction system achieves 0.3 mm when the measurement range is extended to 1100 mm $\times$ 1300 mm $\times$ 650 mm. With the same reconstruction accuracy, the reconstruction range is expanded by a factor of 25, indicating that the proposed method simultaneously allows for high-precision and wide-range 3D reconstruction in industrial applications.
RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization
Li, Zhikai, Liu, Xuewen, Zhang, Jing, Gu, Qingyi
Large transformer models have demonstrated remarkable success. Post-training quantization (PTQ), which requires only a small dataset for calibration and avoids end-to-end retraining, is a promising solution for compressing these large models. Regrettably, existing PTQ methods typically exhibit non-trivial performance loss. We find that the performance bottleneck stems from over-consideration of hardware compatibility in the quantization process, compelling them to reluctantly employ simple quantizers, albeit at the expense of accuracy. With the above insights, we propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm to address the above issues. RepQuant employs complex quantizers in the quantization process and simplified quantizers in the inference process, and performs mathematically equivalent transformations between the two through quantization scale reparameterization, thus ensuring both accurate quantization and efficient inference. More specifically, we focus on two components with extreme distributions: LayerNorm activations and Softmax activations. Initially, we apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively, which are tailored to their distributions. In particular, for the former, we introduce a learnable per-channel dual clipping scheme, which is designed to efficiently identify outliers in the unbalanced activations with fine granularity. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference. Moreover, quantized weight reconstruction is seamlessly integrated into the above procedure to further push the performance limits. Extensive experiments are performed on different large-scale transformer variants on multiple tasks, including vision, language, and multi-modal transformers, and RepQuant encouragingly demonstrates significant performance advantages.
Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models
Liu, Xuewen, Li, Zhikai, Xiao, Junrui, Gu, Qingyi
Diffusion models have achieved great success in image generation tasks through iterative noise estimation. However, the heavy denoising process and complex neural networks hinder their low-latency applications in real-world scenarios. Quantization can effectively reduce model complexity, and post-training quantization (PTQ), which does not require fine-tuning, is highly promising in accelerating the denoising process. Unfortunately, we find that due to the highly dynamic distribution of activations in different denoising steps, existing PTQ methods for diffusion models suffer from distribution mismatch issues at both calibration sample level and reconstruction output level, which makes the performance far from satisfactory, especially in low-bit cases. In this paper, we propose Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models (EDA-DM) to address the above issues. Specifically, at the calibration sample level, we select calibration samples based on the density and diversity in the latent space, thus facilitating the alignment of their distribution with the overall samples; and at the reconstruction output level, we propose Fine-grained Block Reconstruction, which can align the outputs of the quantized model and the full-precision model at different network granularity. Extensive experiments demonstrate that EDA-DM outperforms the existing post-training quantization frameworks in both unconditional and conditional generation scenarios. At low-bit precision, the quantized models with our method even outperform the full-precision models on most datasets.