He, Zhezhi
CLLMs: Consistency Large Language Models
Kou, Siqi, Hu, Lanxiang, He, Zhezhi, Deng, Zhijie, Zhang, Hao
Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4$\times$ to 3.4$\times$ improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks.
SpikeZIP-TF: Conversion is All You Need for Transformer-based SNN
You, Kang, Xu, Zekai, Nie, Chen, Deng, Zhijie, Guo, Qinghai, Wang, Xiang, He, Zhezhi
Spiking neural network (SNN) has attracted great attention due to its characteristic of high efficiency and accuracy. Currently, the ANN-to-SNN conversion methods can obtain ANN on-par accuracy SNN with ultra-low latency (8 time-steps) in CNN structure on computer vision (CV) tasks. However, as Transformer-based networks have achieved prevailing precision on both CV and natural language processing (NLP), the Transformer-based SNNs are still encounting the lower accuracy w.r.t the ANN counterparts. In this work, we introduce a novel ANN-to-SNN conversion method called SpikeZIP-TF, where ANN and SNN are exactly equivalent, thus incurring no accuracy degradation. SpikeZIP-TF achieves 83.82% accuracy on CV dataset (ImageNet) and 93.79% accuracy on NLP dataset (SST-2), which are higher than SOTA Transformer-based SNNs. The code is available in GitHub: https://github.com/Intelligent-Computing-Research-Group/SpikeZIP_transformer
Model Extraction Attacks on Split Federated Learning
Li, Jingtao, Rakin, Adnan Siraj, Chen, Xing, Yang, Li, He, Zhezhi, Fan, Deliang, Chakrabarti, Chaitali
Federated Learning (FL) is a popular collaborative learning scheme involving multiple clients and a server. FL focuses on protecting clients' data but turns out to be highly vulnerable to Intellectual Property (IP) threats. Since FL periodically collects and distributes the model parameters, a free-rider can download the latest model and thus steal model IP. Split Federated Learning (SFL), a recent variant of FL that supports training with resource-constrained clients, splits the model into two, giving one part of the model to clients (client-side model), and the remaining part to the server (server-side model). Thus SFL prevents model leakage by design. Moreover, by blocking prediction queries, it can be made resistant to advanced IP threats such as traditional Model Extraction (ME) attacks. While SFL is better than FL in terms of providing IP protection, it is still vulnerable. In this paper, we expose the vulnerability of SFL and show how malicious clients can launch ME attacks by querying the gradient information from the server side. We propose five variants of ME attack which differs in the gradient usage as well as in the data assumptions. We show that under practical cases, the proposed ME attacks work exceptionally well for SFL. For instance, when the server-side model has five layers, our proposed ME attack can achieve over 90% accuracy with less than 2% accuracy degradation with VGG-11 on CIFAR-10.
N3H-Core: Neuron-designed Neural Network Accelerator via FPGA-based Heterogeneous Computing Cores
Gong, Yu, Xu, Zhihan, He, Zhezhi, Zhang, Weifeng, Tu, Xiaobing, Liang, Xiaoyao, Jiang, Li
Accelerating the neural network inference by FPGA has emerged as a popular option, since the reconfigurability and high performance computing capability of FPGA intrinsically satisfies the computation demand of the fast-evolving neural algorithms. However, the popular neural accelerators on FPGA (e.g., Xilinx DPU) mainly utilize the DSP resources for constructing their processing units, while the rich LUT resources are not well exploited. Via the software-hardware co-design approach, in this work, we develop an FPGA-based heterogeneous computing system for neural network acceleration. From the hardware perspective, the proposed accelerator consists of DSP- and LUT-based GEneral Matrix-Multiplication (GEMM) computing cores, which forms the entire computing system in a heterogeneous fashion. The DSP- and LUT-based GEMM cores are computed w.r.t a unified Instruction Set Architecture (ISA) and unified buffers. Along the data flow of the neural network inference path, the computation of the convolution/fully-connected layer is split into two portions, handled by the DSP- and LUT-based GEMM cores asynchronously. From the software perspective, we mathematically and systematically model the latency and resource utilization of the proposed heterogeneous accelerator, regarding varying system design configurations. Through leveraging the reinforcement learning technique, we construct a framework to achieve end-to-end selection and optimization of the design specification of target heterogeneous accelerator, including workload split strategy, mixed-precision quantization scheme, and resource allocation of DSP- and LUT-core. In virtue of the proposed design framework and heterogeneous computing system, our design outperforms the state-of-the-art Mix&Match design with latency reduced by 1.12-1.32x with higher inference accuracy. The N3H-core is open-sourced at: https://github.com/elliothe/N3H_Core.
MetaGater: Fast Learning of Conditional Channel Gated Networks via Federated Meta-Learning
Lin, Sen, Yang, Li, He, Zhezhi, Fan, Deliang, Zhang, Junshan
While deep learning has achieved phenomenal successes in many AI applications, its enormous model size and intensive computation requirements pose a formidable challenge to the deployment in resource-limited nodes. There has recently been an increasing interest in computationally-efficient learning methods, e.g., quantization, pruning and channel gating. However, most existing techniques cannot adapt to different tasks quickly. In this work, we advocate a holistic approach to jointly train the backbone network and the channel gating which enables dynamical selection of a subset of filters for more efficient local computation given the data input. Particularly, we develop a federated meta-learning approach to jointly learn good meta-initializations for both backbone networks and gating modules, by making use of the model similarity across learning tasks on different nodes. In this way, the learnt meta-gating module effectively captures the important filters of a good meta-backbone network, based on which a task-specific conditional channel gated network can be quickly adapted, i.e., through one-step gradient descent, from the meta-initializations in a two-stage procedure using new samples of that task. The convergence of the proposed federated meta-learning algorithm is established under mild conditions. Experimental results corroborate the effectiveness of our method in comparison to related work.
KSM: Fast Multiple Task Adaption via Kernel-wise Soft Mask Learning
Yang, Li, He, Zhezhi, Zhang, Junshan, Fan, Deliang
Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as \textit{catastrophic forgetting}. While recent continual learning methods are capable of alleviating the catastrophic problem on toy-sized datasets, some issues still remain to be tackled when applying them in real-world problems. Recently, the fast mask-based learning method (e.g. piggyback \cite{mallya2018piggyback}) is proposed to address these issues by learning only a binary element-wise mask in a fast manner, while keeping the backbone model fixed. However, the binary mask has limited modeling capacity for new tasks. A more recent work \cite{hung2019compacting} proposes a compress-grow-based method (CPG) to achieve better accuracy for new tasks by partially training backbone model, but with order-higher training cost, which makes it infeasible to be deployed into popular state-of-the-art edge-/mobile-learning. The primary goal of this work is to simultaneously achieve fast and high-accuracy multi task adaption in continual learning setting. Thus motivated, we propose a new training method called \textit{kernel-wise Soft Mask} (KSM), which learns a kernel-wise hybrid binary and real-value soft mask for each task, while using the same backbone model. Such a soft mask can be viewed as a superposition of a binary mask and a properly scaled real-value tensor, which offers a richer representation capability without low-level kernel support to meet the objective of low hardware overhead. We validate KSM on multiple benchmark datasets against recent state-of-the-art methods (e.g. Piggyback, Packnet, CPG, etc.), which shows good improvement in both accuracy and training cost.
T-BFA: Targeted Bit-Flip Adversarial Weight Attack
Rakin, Adnan Siraj, He, Zhezhi, Li, Jingtao, Yao, Fan, Chakrabarti, Chaitali, Fan, Deliang
Traditional Deep Neural Network (DNN) security is mostly related to the well-known adversarial input example attack. Recently, another dimension of adversarial attack, namely, attack on DNN weight parameters, has been shown to be very powerful. As a representative one, the Bit-Flip based adversarial weight Attack(BFA) injects an extremely small amount of fault into weight parameters to hijack the DNN function. Prior works on BFA are focused on-targeted attacks that can classify all inputs into a random output class by flipping a very small number of weight bits stored in computer memory. This paper proposes the first work oftargetedBFA based (T-BFA) adversarial weight attack on DNN models, which can intentionally mislead selected inputs to a target output class. The objectives achieved by identifying the weight bits that are highly associated with the classification of a targeted output through a novel class-dependent weight bit ranking algorithm. T-BFA performance has been successfully demonstrated on multiple network architectures for the image classification task. For example, by merely flipping 27 out of 88 million weight bits, T-BFA can misclassify all the imagesfrom 'Ibex' class into 'Proboscis Monkey' class (i.e., 100% attack success rate)in ImageNet dataset, while maintaining 59.35% validation accuracy on ResNet-18. Moreover, we successfully demonstrate our T-BFA attack in a real computer prototype system running DNN computation.
Non-structured DNN Weight Pruning Considered Harmful
Wang, Yanzhi, Ye, Shaokai, He, Zhezhi, Ma, Xiaolong, Zhang, Linfeng, Lin, Sheng, Yuan, Geng, Tan, Sia Huat, Li, Zhengang, Fan, Deliang, Qian, Xuehai, Lin, Xue, Ma, Kaisheng
Large deep neural network (DNN) models pose the key challenge to energy efficiency due to the significantly higher energy consumption of off-chip DRAM accesses than arithmetic or SRAM operations. It motivates the intensive research on model compression with two main approaches. Weight pruning leverages the redundancy in the number of weights and can be performed in a non-structured, which has higher flexibility and pruning rate but incurs index accesses due to irregular weights, or structured manner, which preserves the full matrix structure with lower pruning rate. Weight quantization leverages the redundancy in the number of bits in weights. Compared to pruning, quantization is much more hardware-friendly, and has become a "must-do" step for FPGA and ASIC implementations. This paper provides a definitive answer to the question for the first time. First, we build ADMM-NN-S by extending and enhancing ADMM-NN, a recently proposed joint weight pruning and quantization framework. Second, we develop a methodology for fair and fundamental comparison of non-structured and structured pruning in terms of both storage and computation efficiency. Our results show that ADMM-NN-S consistently outperforms the prior art: (i) it achieves 348x, 36x, and 8x overall weight pruning on LeNet-5, AlexNet, and ResNet-50, respectively, with (almost) zero accuracy loss; (ii) we demonstrate the first fully binarized (for all layers) DNNs can be lossless in accuracy in many cases. These results provide a strong baseline and credibility of our study. Based on the proposed comparison framework, with the same accuracy and quantization, the results show that non-structrued pruning is not competitive in terms of both storage and computation efficiency. Thus, we conclude that non-structured pruning is considered harmful. We urge the community not to continue the DNN inference acceleration for non-structured sparsity.
Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation
He, Zhezhi, Fan, Deliang
In the past years, Deep convolution neural network has achieved great success in many artificial intelligence applications. However, its enormous model size and massive computation cost have become the main obstacle for deployment of such powerful algorithm in the low power and resource-limited mobile systems. As the countermeasure to this problem, deep neural networks with ternarized weights (i.e. -1, 0, +1) have been widely explored to greatly reduce the model size and computational cost, with limited accuracy degradation. In this work, we propose a novel ternarized neural network training method which simultaneously optimizes both weights and quantizer during training, differentiating from prior works. Instead of fixed and uniform weight ternarization, we are the first to incorporate the thresholds of weight ternarization into a closed-form representation using the truncated Gaussian approximation, enabling simultaneous optimization of weights and quantizer through back-propagation training. With both of the first and last layer ternarized, the experiments on the ImageNet classification task show that our ternarized ResNet-18/34/50 only has 3.9/2.52/2.16% accuracy degradation in comparison to the full-precision counterparts.
Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples
Rakin, Adnan Siraj, He, Zhezhi, Gong, Boqing, Fan, Deliang
Deep learning algorithms and networks are vulnerable to perturbed inputs which is known as the adversarial attack. Many defense methodologies have been investigated to defend against such adversarial attack. In this work, we propose a novel methodology to defend the existing powerful attack model. We for the first time introduce a new attacking scheme for the attacker and set a practical constraint for white box attack. Under this proposed attacking scheme, we present the best defense ever reported against some of the recent strong attacks. It consists of a set of nonlinear function to process the input data which will make it more robust over the adversarial attack. However, we make this processing layer completely hidden from the attacker. Blind pre-processing improves the white box attack accuracy of MNIST from 94.3\% to 98.7\%. Even with increasing defense when others defenses completely fail, blind pre-processing remains one of the strongest ever reported. Another strength of our defense is that it eliminates the need for adversarial training as it can significantly increase the MNIST accuracy without adversarial training as well. Additionally, blind pre-processing can also increase the inference accuracy in the face of a powerful attack on CIFAR-10 and SVHN data set as well without much sacrificing clean data accuracy.