Unifying Synergies between Self-supervised Learning and Dynamic Computation
Krishna, Tarun, Rai, Ayush K, Drimbarean, Alexandru, Arazo, Eric, Albert, Paul, Smeaton, Alan F, McGuinness, Kevin, O'Connor, Noel E
–arXiv.org Artificial Intelligence
Self-supervised representation learning methods [4, 7, 11, 12, 14] are the standard approach for training large scale deep neural networks (DNNs). One of the main reasons for their popularity is their capability to leverage the inherent structure of data from a vast unlabeled corpus during pre-training, which makes them highly suitable for transfer learning [28]. However, this comes at the cost of substantially larger model size, computationally expensive training strategies (larger training times, large batch-sizes, etc.) [13, 28] and subsequently more expensive inference times. Though such strategies are effective for achieving state-of-the-art results in computer vision, they may not be practical in resource-constrained industrial settings that require lightweight models to be deployed on edge devices. To lessen the computational burden, it is common to extract (or learn) a lightweight network from an off-the-shelf pre-trained model. This has been successfully achieved through techniques such as knowledge distillation (KD) [35], pruning [24], dynamic computation (DC) [58], etc. KD methods follow a standard two-step procedure of pre-training and distilling knowledge into a student network using self-supervised (SS) objective [1, 21, 51] or by together incorporating supervised and SS objectives [54], while pruning based approaches heavily rely on multiple steps of pre-train prune finetune to get a lightweight network irrespective of the objective, whereas methods based on dynamic/conditional computation [34, 58] again rely on a pre-trained model to obtain a lightweight network while keeping the network topology intact via a gating mechanism. These approaches are effective but using fine-tuning to obtain a sub-network from large pre-trained models (such as Large Language Models) can be computationally expensive and cumbersome. Also, since downstream tasks are diverse and vary widely, any change in the task requires repeating the entire procedure multiple times, making it inefficient and less transferable.
arXiv.org Artificial Intelligence
Sep-9-2023
- Country:
- Europe > Ireland
- Connaught > County Galway
- Galway (0.04)
- Leinster > County Dublin
- Dublin (0.05)
- Connaught > County Galway
- Europe > Ireland
- Genre:
- Research Report > New Finding (0.46)
- Technology: