Zhao, Tianchen
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
Zhao, Tianchen, Ning, Xuefei, Fang, Tongcheng, Liu, Enshu, Huang, Guyue, Lin, Zinan, Yan, Shengen, Dai, Guohao, Wang, Yu
Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values (INT4/8) , which is an effective and efficient technique to reduce the memory cost. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment. To address this issue, we propose an mixed-precision quantization framework - MixDQ. Firstly, We design specialized BOS-aware quantization method for highly sensitive text embedding quantization. Then, we conduct metric-decoupled sensitivity analysis to measure the sensitivity of each layer. Finally, we develop an integer-programming-based method to conduct bit-width allocation. While existing quantization methods fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8 with negligible visual degradation. Compared with FP16, we achieve 3-4x reduction in model size and memory cost, and 1.45x latency speedup.
Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"
Zhao, Junbo, Ning, Xuefei, Liu, Enshu, Ru, Binxin, Zhou, Zixuan, Zhao, Tianchen, Chen, Chen, Zhang, Jiajin, Liao, Qingmin, Wang, Yu
Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.
Multi-shot NAS for Discovering Adversarially Robust Convolutional Neural Architectures at Targeted Capacities
Ning, Xuefei, Zhao, Junbo, Li, Wenshuo, Zhao, Tianchen, Yang, Huazhong, Wang, Yu
Convolutional neural networks (CNNs) are vulnerable to adversarial examples, and studies show that increasing the model capacity of an architecture topology (e.g., width expansion) can bring consistent robustness improvements. This reveals a clear robustness-efficiency trade-off that should be considered in architecture design. Recent studies have employed one-shot neural architecture search (NAS) to discover adversarially robust architectures. However, since the capacities of different topologies cannot be easily aligned during the search process, current one-shot NAS methods might favor topologies with larger capacity in the supernet. And the discovered topology might be sub-optimal when aligned to the targeted capacity. This paper proposes a novel multi-shot NAS method to explicitly search for adversarially robust architectures at a certain targeted capacity. Specifically, we estimate the reward at the targeted capacity using interior extra-polation of the rewards from multiple supernets. Experimental results demonstrate the effectiveness of the proposed method. For instance, at the targeted FLOPs of 1560M, the discovered MSRobNet-1560 (clean 84.8%, PGD100 52.9%) outperforms the recent NAS-discovered architecture RobNet-free (clean 82.8%, PGD100 52.6%) with similar FLOPs. Codes are available at https://github.com/walkerning/aw_nas.
BARS: Joint Search of Cell Topology and Layout for Accurate and Efficient Binary ARchitectures
Zhao, Tianchen, Ning, Xuefei, Yang, Songyi, Liang, Shuang, Lei, Peng, Chen, Jianfei, Yang, Huazhong, Wang, Yu
Binary Neural Networks (BNNs) have received significant attention due to their promising efficiency. Currently, most BNN studies directly adopt widely-used CNN architectures, which can be suboptimal for BNNs. This paper proposes a novel Binary ARchitecture Search (BARS) flow to discover superior binary architecture in a large design space. Specifically, we design a two-level (Macro \& Micro) search space tailored for BNNs and apply a differentiable neural architecture search (NAS) to explore this search space efficiently. The macro-level search space includes depth and width decisions, which is required for better balancing the model performance and capacity. And we also make modifications to the micro-level search space to strengthen the information flow for BNN. A notable challenge of BNN architecture search lies in that binary operations exacerbate the "collapse" problem of differentiable NAS, and we incorporate various search and derive strategies to stabilize the search process. On CIFAR-10, \method achieves $1.5\%$ higher accuracy with $2/3$ binary Ops and $1/10$ floating-point Ops. On ImageNet, with similar resource consumption, \method-discovered architecture achieves $3\%$ accuracy gain than hand-crafted architectures, while removing the full-precision downsample layer.
Towards Lower Bit Multiplication for Convolutional Neural Network Training
Zhong, Kai, Zhao, Tianchen, Ning, Xuefei, Zeng, Shulin, Guo, Kaiyuan, Wang, Yu, Yang, Huazhong
Convolutional Neural Networks (CNNs) have been widely used in many fields. However, the training process costs much energy and time, in which the convolution operations consume the major part. In this paper, we propose a fixed-point training framework, in order to reduce the data bit-width for the convolution multiplications. Firstly, we propose two constrained group-wise scaling methods that can be implemented with low hardware cost. Secondly, to overcome the challenge of trading off overflow and rounding error, a shiftable fixed-point data format is used in this framework. Finally, we propose a double-width deployment technique to boost inference performance with the same bit-width hardware multiplier. The experimental results show that the input data of convolution in the training process can be quantized to 2-bit for CIFAR-10 dataset, 6-bit for ImageNet dataset, with negligible accuracy degradation. Furthermore, our fixed-point train-ing framework has the potential to save at least 75% energy of the computation in the training process.
Diversity-Sensitive Conditional Generative Adversarial Networks
Yang, Dingdong, Hong, Seunghoon, Jang, Yunseok, Zhao, Tianchen, Lee, Honglak
We propose a simple yet highly effective method that addresses the mode-collapse problem in the Conditional Generative Adversarial Network (cGAN). Although conditional distributions are multi-modal (i.e., having many modes) in practice, most cGAN approaches tend to learn an overly simplified distribution where an input is always mapped to a single output regardless of variations in latent code. To address such issue, we propose to explicitly regularize the generator to produce diverse outputs depending on latent codes. The proposed regularization is simple, general, and can be easily integrated into most conditional GAN objectives. Additionally, explicit regularization on generator allows our method to control a balance between visual quality and diversity. We demonstrate the effectiveness of our method on three conditional generation tasks: image-to-image translation, image inpainting, and future video prediction. We show that simple addition of our regularization to existing models leads to surprisingly diverse generations, substantially outperforming the previous approaches for multi-modal conditional generation specifically designed in each individual task.
Information Theoretic Interpretation of Deep learning
Zhao, Tianchen, Sun, Yuekai