Not enough data to create a plot.
Try a different view from the menu above.
Dong, Peijie
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
Zhang, Longteng, Liu, Xiang, Li, Zeyu, Pan, Xinglin, Dong, Peijie, Fan, Ruibo, Guo, Rui, Wang, Xin, Luo, Qiong, Shi, Shaohuai, Chu, Xiaowen
Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization
Dong, Peijie, Li, Lujun, Wei, Zimian, Niu, Xin, Tian, Zhiliang, Pan, Hengyue
Mixed-Precision Quantization~(MQ) can achieve a competitive accuracy-complexity trade-off for models. Conventional training-based search methods require time-consuming candidate training to search optimized per-layer bit-width configurations in MQ. Recently, some training-free approaches have presented various MQ proxies and significantly improve search efficiency. However, the correlation between these proxies and quantization accuracy is poorly understood. To address the gap, we first build the MQ-Bench-101, which involves different bit configurations and quantization results. Then, we observe that the existing training-free proxies perform weak correlations on the MQ-Bench-101. To efficiently seek superior proxies, we develop an automatic search of proxies framework for MQ via evolving algorithms. In particular, we devise an elaborate search space involving the existing proxies and perform an evolution search to discover the best correlated MQ proxy. We proposed a diversity-prompting selection strategy and compatibility screening protocol to avoid premature convergence and improve search efficiency. In this way, our Evolving proxies for Mixed-precision Quantization~(EMQ) framework allows the auto-generation of proxies without heavy tuning and expert knowledge. Extensive experiments on ImageNet with various ResNet and MobileNet families demonstrate that our EMQ obtains superior performance than state-of-the-art mixed-precision methods at a significantly reduced cost. The code will be released.
DisWOT: Student Architecture Search for Distillation WithOut Training
Dong, Peijie, Li, Lujun, Wei, Zimian
Knowledge distillation (KD) is an effective training strategy to improve the lightweight student models under the guidance of cumbersome teachers. However, the large architecture difference across the teacher-student pairs limits the distillation gains. In contrast to previous adaptive distillation methods to reduce the teacher-student gap, we explore a novel training-free framework to search for the best student architectures for a given teacher. Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation. Secondly, we find that the similarity of feature semantics and sample relations between random-initialized teacher-student networks have good correlations with final distillation performances. Thus, we efficiently measure similarity matrixs conditioned on the semantic activation maps to select the optimal student via an evolutionary algorithm without any training. In this way, our student architecture search for Distillation WithOut Training (DisWOT) significantly improves the performance of the model in the distillation stage with at least 180$\times$ training acceleration. Additionally, we extend similarity metrics in DisWOT as new distillers and KD-based zero-proxies. Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces. Our project and code are available at https://lilujunai.github.io/DisWOT-CVPR2023/.
Multi-trial Neural Architecture Search with Lottery Tickets
Wei, Zimian, Pan, Hengyue, Li, Lujun, Lu, Menglong, Niu, Xin, Dong, Peijie, Li, Dongsheng
Neural architecture search (NAS) has brought significant progress in recent image recognition tasks. Most existing NAS methods apply restricted search spaces, which limits the upper-bound performance of searched models. To address this issue, we propose a new search space named MobileNet3-MT. By reducing human-prior knowledge in omni dimensions of networks, MobileNet3-MT accommodates more potential candidates. For searching in this challenging search space, we present an efficient Multi-trial Evolution-based NAS method termed MENAS. Specifically, we accelerate the evolutionary search process by gradually pruning models in the population. Each model is trained with an early stop and replaced by its Lottery Tickets (the explored optimal pruned network).In this way, the full training pipeline of cumbersome networks is prevented and more efficient networks are automatically generated. Extensive experimental results on ImageNet-1K, CIFAR-10, and CIFAR-100 demonstrate that MENAS achieves state-of-the-art performance.
DMFormer: Closing the Gap Between CNN and Vision Transformers
Wei, Zimian, Pan, Hengyue, Li, Lujun, Lu, Menglong, Niu, Xin, Dong, Peijie, Li, Dongsheng
Vision transformers have shown excellent performance in computer vision tasks. As the computation cost of their self-attention mechanism is expensive, recent works tried to replace the self-attention mechanism in vision transformers with convolutional operations, which is more efficient with built-in inductive bias. However, these efforts either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a Dynamic Multi-level Attention mechanism (DMA), which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on DMA, we present an efficient backbone network named DMFormer. DMFormer adopts the overall architecture of vision transformers, while replacing the self-attention mechanism with our proposed DMA. Extensive experimental results on ImageNet-1K and ADE20K datasets demonstrated that DMFormer achieves state-of-the-art performance, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs).