Samragh, Mohammad
SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models
Kim, Han-Byul, Hoang, Duc, Kundu, Arnav, Samragh, Mohammad, Cho, Minsik
With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.
Weight subcloning: direct initialization of transformers using larger pretrained ones
Samragh, Mohammad, Farajtabar, Mehrdad, Mehta, Sachin, Vemulapalli, Raviteja, Faghri, Fartash, Naik, Devang, Tuzel, Oncel, Rastegari, Mohammad
Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
CLEANN: Accelerated Trojan Shield for Embedded Neural Networks
Javaheripi, Mojan, Samragh, Mohammad, Fields, Gregory, Javidi, Tara, Koushanfar, Farinaz
We propose CLEANN, the first end-to-end framework that enables online mitigation of Trojans for embedded Deep Neural Network (DNN) applications. A Trojan attack works by injecting a backdoor in the DNN while training; during inference, the Trojan can be activated by the specific backdoor trigger. What differentiates CLEANN from the prior work is its lightweight methodology which recovers the ground-truth class of Trojan samples without the need for labeled data, model retraining, or prior assumptions on the trigger or the attack. We leverage dictionary learning and sparse approximation to characterize the statistical behavior of benign data and identify Trojan triggers. CLEANN is devised based on algorithm/hardware co-design and is equipped with specialized hardware to enable efficient real-time execution on resource-constrained embedded platforms. Proof of concept evaluations on CLEANN for the state-of-the-art Neural Trojan attacks on visual benchmarks demonstrate its competitive advantage in terms of attack resiliency and execution overhead.
RAPIDNN: In-Memory Deep Neural Network Acceleration Framework
Imani, Mohsen, Samragh, Mohammad, Kim, Yeseong, Gupta, Saransh, Koushanfar, Farinaz, Rosing, Tajana
Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image processing, video segmentation, and speech recognition. Running state-of-the-art DNNs on current systems mostly relies on either generalpurpose processors, ASIC designs, or FPGA accelerators, all of which suffer from data movements due to the limited onchip memory and data transfer bandwidth. In this work, we propose a novel framework, called RAPIDNN, which processes all DNN operations within the memory to minimize the cost of data movement. To enable in-memory processing, RAPIDNN reinterprets a DNN model and maps it into a specialized accelerator, which is designed using non-volatile memory blocks that model four fundamental DNN operations, i.e., multiplication, addition, activation functions, and pooling. The framework extracts representative operands of a DNN model, e.g., weights and input values, using clustering methods to optimize the model for in-memory processing. Then, it maps the extracted operands and their precomputed results into the accelerator memory blocks. At runtime, the accelerator identifies computation results based on efficient in-memory search capability which also provides tunability of approximation to further improve computation efficiency. Our evaluation shows that RAPIDNN achieves 68.4x, 49.5x energy efficiency improvement and 48.1x, 10.9x speedup as compared to ISAAC and PipeLayer, the state-of-the-art DNN accelerators, while ensuring less than 0.3% of quality loss.
CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs
Samragh, Mohammad, Javaheripi, Mojan, Koushanfar, Farinaz
This paper proposes CodeX, an end-to-end framework that facilitates encoding, bitwidth customization, fine-tuning, and implementation of neural networks on FPGA platforms. CodeX incorporates nonlinear encoding to the computation flow of neural networks to save memory. The encoded features demand significantly lower storage compared to the raw full-precision activation values; therefore, the execution flow of CodeX hardware engine is completely performed within the FPGA using on-chip streaming buffers with no access to the off-chip DRAM. We further propose a fully-automated algorithm inspired by reinforcement learning which determines the customized encoding bitwidth across network layers. CodeX full-stack framework comprises of a compiler which takes a high-level Python description of an arbitrary neural network architecture. The compiler then instantiates the corresponding elements from CodeX Hardware library for FPGA implementation. Proof-of-concept evaluations on MNIST, SVHN, and CIFAR-10 datasets demonstrate an average of 4.65x throughput improvement compared to stand-alone weight encoding. We further compare CodeX with six existing full-precision DNN accelerators on ImageNet, showing an average of 3.6x and 2.54x improvement in throughput and performance-per-watt, respectively.
CuRTAIL: ChaRacterizing and Thwarting AdversarIal deep Learning
Rouhani, Bita Darvish, Samragh, Mohammad, Javidi, Tara, Koushanfar, Farinaz
This paper proposes CuRTAIL, an end-to-end computing framework for characterizing and thwarting adversarial space in the context of Deep Learning (DL). The framework protects deep neural networks against adversarial samples, which are perturbed inputs carefully crafted by malicious entities to mislead the underlying DL model. The precursor for the proposed methodology is a set of new quantitative metrics to assess the vulnerability of various deep learning architectures to adversarial samples. CuRTAIL formalizes the goal of preventing adversarial samples as a minimization of the space unexplored by the pertinent DL model that is characterized in CuRTAIL vulnerability analysis step. To thwart the adversarial machine learning attack, CuRTAIL introduces the concept of Modular Robust Redundancy (MRR) as a viable solution to achieve the formalized minimization objective. The MRR methodology explicitly characterizes the geometry of the input data and the DL model parameters. It then learns a set of complementary but disjoint models which maximally cover the unexplored subspaces of the target DL model, thus reducing the risk of integrity attacks. We extensively evaluate CuRTAIL performance against the state-of-the-art attack models including fast-sign-gradient, Jacobian Saliency Map Attack, Deepfool, and Carlini&WagnerL2. Proof-of-concept implementations for analyzing various data collections including MNIST, CIFAR10, and ImageNet corroborate CuRTAIL effectiveness to detect adversarial samples in different settings. The computations in each MRR module can be performed independently. As such, CuRTAIL detection algorithm can be completely parallelized among multiple hardware settings to achieve maximum throughput. We further provide an accompanying API to facilitate the adoption of the proposed framework for various applications.