Peng, Hongwu
AutoReP: Automatic ReLU Replacement for Fast Private Network Inference
Peng, Hongwu, Huang, Shaoyi, Zhou, Tong, Luo, Yukui, Wang, Chenghong, Wang, Zigeng, Zhao, Jiahui, Xie, Xi, Li, Ang, Geng, Tony, Mahmood, Kaleel, Wen, Wujie, Xu, Xiaolin, Ding, Caiwen
The growth of the Machine-Learning-As-A-Service (MLaaS) market has highlighted clients' data privacy and security issues. Private inference (PI) techniques using cryptographic primitives offer a solution but often have high computation and communication costs, particularly with non-linear operators like ReLU. Many attempts to reduce ReLU operations exist, but they may need heuristic threshold selection or cause substantial accuracy loss. This work introduces AutoReP, a gradient-based approach to lessen non-linear operators and alleviate these issues. It automates the selection of ReLU and polynomial functions to speed up PI applications and introduces distribution-aware polynomial approximation (DaPa) to maintain model expressivity while accurately approximating ReLUs. Our experimental results demonstrate significant accuracy improvements of 6.12% (94.31%, 12.9K ReLU budget, CIFAR-10), 8.39% (74.92%, 12.9K ReLU budget, CIFAR-100), and 9.45% (63.69%, 55K ReLU budget, Tiny-ImageNet) over current state-of-the-art methods, e.g., SNL. Morever, AutoReP is applied to EfficientNet-B2 on ImageNet dataset, and achieved 75.55% accuracy with 176.1 times ReLU budget reduction.
Dynamic Sparse Training via Balancing the Exploration-Exploitation Trade-off
Huang, Shaoyi, Lei, Bowen, Xu, Dongkuan, Peng, Hongwu, Sun, Yue, Xie, Mimi, Ding, Caiwen
Over-parameterization of deep neural networks (DNNs) has shown high prediction accuracy for many applications. Although effective, the large number of parameters hinders its popularity on resource-limited devices and has an outsize environmental impact. Sparse training (using a fixed number of nonzero weights in each iteration) could significantly mitigate the training costs by reducing the model size. However, existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies, resulting in local minimal and low accuracy. In this work, we consider the dynamic sparse training as a sparse connectivity search problem and design an exploitation and exploration acquisition function to escape from local optima and saddle points. We further design an acquisition function and provide the theoretical guarantees for the proposed method and clarify its convergence property. Experimental results show that sparse models (up to 98\% sparsity) obtained by our proposed method outperform the SOTA sparse training methods on a wide variety of deep learning tasks. On VGG-19 / CIFAR-100, ResNet-50 / CIFAR-10, ResNet-50 / CIFAR-100, our method has even higher accuracy than dense models. On ResNet-50 / ImageNet, the proposed method has up to 8.2\% accuracy improvement compared to SOTA sparse training methods.
Towards Sparsification of Graph Neural Networks
Peng, Hongwu, Gurevin, Deniz, Huang, Shaoyi, Geng, Tong, Jiang, Weiwen, Khan, Omer, Ding, Caiwen
As real-world graphs expand in size, larger GNN models with billions of parameters are deployed. High parameter count in such models makes training and inference on graphs expensive and challenging. To reduce the computational and memory costs of GNNs, optimization methods such as pruning the redundant nodes and edges in input graphs have been commonly adopted. However, model compression, which directly targets the sparsification of model layers, has been mostly limited to traditional Deep Neural Networks (DNNs) used for tasks such as image classification and object detection. In this paper, we utilize two state-of-the-art model compression methods (1) train and prune and (2) sparse training for the sparsification of weight layers in GNNs. We evaluate and compare the efficiency of both methods in terms of accuracy, training sparsity, and training FLOPs on real-world graphs. Our experimental results show that on the ia-email, wiki-talk, and stackoverflow datasets for link prediction, sparse training with much lower training FLOPs achieves a comparable accuracy with the train and prune method. On the brain dataset for node classification, sparse training uses a lower number FLOPs (less than 1/7 FLOPs of train and prune method) and preserves a much better accuracy performance under extreme model sparsity.
RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference
Peng, Hongwu, Zhou, Shanglin, Luo, Yukui, Xu, Nuo, Duan, Shijin, Ran, Ran, Zhao, Jiahui, Huang, Shaoyi, Xie, Xi, Wang, Chenghong, Geng, Tong, Wen, Wujie, Xu, Xiaolin, Ding, Caiwen
In this work, we propose a novel approach, Machine-Learning-as-a-Service (MLaaS) has emerged as the ReLU-Reduced Neural Architecture Search a popular solution for accelerating inference in various applications (RRNet) framework, that jointly optimizes the structure of [1]-[11]. The challenges of MLaaS comes from the deep neural network (DNN) model and the hardware several folds: inference latency and privacy. To accelerate the architecture to support high-performance MPC-based PI. Our MLaaS training and inference application, accelerated gradient framework eliminates the need for manual heuristic analysis sparsification [12], [13] and model compression methods [14]- by automating the process of exploring the design space [22] are proposed. On the other side, a major limitation of and identifying the optimal configuration of DNN models MLaaS is the requirement for clients to reveal raw input data and hardware architectures for 2PC-based PI. We use FPGA to the service provider, which may compromise the privacy accelerator design as a demonstration and summarize our of users. This issue has been highlighted in previous studies contributions: such as [23]. In this work, we aim to address this challenge 1) We propose a novel approach to addressing the high by proposing a novel approach for privacy-preserving MLaaS.
PolyMPCNet: Towards ReLU-free Neural Architecture Search in Two-party Computation Based Private Inference
Peng, Hongwu, Zhou, Shanglin, Luo, Yukui, Duan, Shijin, Xu, Nuo, Ran, Ran, Huang, Shaoyi, Wang, Chenghong, Geng, Tong, Li, Ang, Wen, Wujie, Xu, Xiaolin, Ding, Caiwen
The rapid growth and deployment of deep learning (DL) has witnessed emerging privacy and security concerns. To mitigate these issues, secure multi-party computation (MPC) has been discussed, to enable the privacy-preserving DL computation. In practice, they often come at very high computation and communication overhead, and potentially prohibit their popularity in large scale systems. Two orthogonal research trends have attracted enormous interests in addressing the energy efficiency in secure deep learning, i.e., overhead reduction of MPC comparison protocol, and hardware acceleration. However, they either achieve a low reduction ratio and suffer from high latency due to limited computation and communication saving, or are power-hungry as existing works mainly focus on general computing platforms such as CPUs and GPUs. In this work, as the first attempt, we develop a systematic framework, PolyMPCNet, of joint overhead reduction of MPC comparison protocol and hardware acceleration, by integrating hardware latency of the cryptographic building block into the DNN loss function to achieve high energy efficiency, accuracy, and security guarantee. Instead of heuristically checking the model sensitivity after a DNN is well-trained (through deleting or dropping some non-polynomial operators), our key design principle is to em enforce exactly what is assumed in the DNN design -- training a DNN that is both hardware efficient and secure, while escaping the local minima and saddle points and maintaining high accuracy. More specifically, we propose a straight through polynomial activation initialization method for cryptographic hardware friendly trainable polynomial activation function to replace the expensive 2P-ReLU operator. We develop a cryptographic hardware scheduler and the corresponding performance model for Field Programmable Gate Arrays (FPGA) platform.
Aerial Manipulation Using a Novel Unmanned Aerial Vehicle Cyber-Physical System
Ding, Caiwu, Peng, Hongwu, Lu, Lu, Ding, Caiwen
Unmanned Aerial Vehicles(UAVs) are attaining more and more maneuverability and sensory ability as a promising teleoperation platform for intelligent interaction with the environments. This work presents a novel 5-degree-of-freedom (DoF) unmanned aerial vehicle (UAV) cyber-physical system for aerial manipulation. This UAV's body is capable of exerting powerful propulsion force in the longitudinal direction, decoupling the translational dynamics and the rotational dynamics on the longitudinal plane. A high-level impedance control law is proposed to drive the vehicle for trajectory tracking and interaction with the environments. In addition, a vision-based real-time target identification and tracking method integrating a YOLO v3 real-time object detector with feature tracking, and morphological operations is proposed to be implemented onboard the vehicle with support of model compression techniques to eliminate latency caused by video wireless transmission and heavy computation burden on traditional teleoperation platforms.