Run LoRA Run: Faster and Lighter LoRA Implementations
Cherniuk, Daria, Mikhalev, Aleksandr, Oseledets, Ivan
–arXiv.org Artificial Intelligence
LoRA Hu et al. [2022] paper has introduced low-rank adapters to fine-tune large LLMs on downstream tasks. This approach quickly became popular due to reduced cost of the update. Different modifications of LoRA followed, for example, QLoRA Dettmers et al. [2023] utilizes quantization and further reduces fine-tuning costs, and ReLoRA Lialin et al. [2023] which showed that low-rank updates can also be used for full training. However, all variations of LoRA use the same chain of operations while calculating the output, which often leads to sub-optimal graph of computations. We propose RunLora: a framework which contains different variations of forward and backward pass through an adapter-induced linear layer and chooses the best pair for a given architecture. We evaluated our framework's performance on a series of Llama models and achieved up to 17% speedup only due to optimized chain of PyTorch operations. Additionally, we managed to save up to 4Gb of memory due to reduction in number of saved activations.
arXiv.org Artificial Intelligence
Dec-6-2023