Training Neural Networks from Scratch with Parallel Low-Rank Adapters
Huh, Minyoung, Cheung, Brian, Bernstein, Jeremy, Isola, Phillip, Agrawal, Pulkit
–arXiv.org Artificial Intelligence
Although our method extends the training The focus of this work is on the low-rank adapter (Hu et al., samples required for convergence by 40%, we can fit models 2022, LoRA), a subclass of linear adapters. The linearity that are 3 bigger with roughly half the bandwidth. LoRA is frequently used for finetuning transformers, often resulting in less than Our work explores a new territory in the pre-training 10% of the total trainable parameters (even as low as 0.5%). Although the forward pass incurs an extra computational overhead, the significance of LoRA parameterization pertains ReLoRA (Lialin et al., 2023) sequentially trains and merges to the optimizer memory footprint. However, it AdamW (Kingma & Ba, 2015; Loshchilov & Hutter, 2019) does not yield comparable pre-training performance without typically maintain two states for each parameter, resulting in initial full-parameter training. In contrast, our work uses memory consumption that is twice the size of the trainable parallel updates to match pre-training performance without parameters. FedLoRA focuses on the distributed 2023) achieves further memory savings by storing W in finetuning of LoRA parameters. These works have catalyzed AdaMix (Wang et al., 2022) averages all MLPs in a Mixture the development of several repositories (Wang, 2023; of Experts (MOE) into a single MLP. AdaMix requires Dettmers et al., 2023; Dettmers, 2023; huggingface, 2023), constant synchronization during the forward and backward enabling finetuning of models with billions of parameters passes. Our work requires no synchronization in the forward on low-memory devices. For an in-depth discussion of related works, see Section 5. To understand the conditions required to pre-train a model with LoRA, we first identify a specific scenario where standard Unless stated otherwise, we denote x as a scalar, x a vector, This serves as a guide for developing our algorithm that X a matrix, X a distribution or a set, f() a function and retains the memory efficiency of LoRA. Although low-rank adapters (LoRAs) have proven to be an effective finetuning method, they have apparent limitations 2.1. As evidenced in Figure 2, models parameterized with LoRA demonstrate inferior performance Adapters serve as trainable functions that modify existing compared to models trained using standard optimization. They facilitate parameterefficient This performance gap isn't surprising as it can be attributed finetuning of large-scale models by minimizing the to the inherent rank constraint in LoRA.
arXiv.org Artificial Intelligence
Feb-26-2024