Run LoRA Run: Faster and Lighter LoRA Implementations

Cherniuk, Daria, Mikhalev, Aleksandr, Oseledets, Ivan

arXiv.org Artificial Intelligence 

LoRA Hu et al. [2022] paper has introduced low-rank adapters to fine-tune large LLMs on downstream tasks. This approach quickly became popular due to reduced cost of the update. Different modifications of LoRA followed, for example, QLoRA Dettmers et al. [2023] utilizes quantization and further reduces fine-tuning costs, and ReLoRA Lialin et al. [2023] which showed that low-rank updates can also be used for full training. However, all variations of LoRA use the same chain of operations while calculating the output, which often leads to sub-optimal graph of computations. We propose RunLora: a framework which contains different variations of forward and backward pass through an adapter-induced linear layer and chooses the best pair for a given architecture. We evaluated our framework's performance on a series of Llama models and achieved up to 17% speedup only due to optimized chain of PyTorch operations. Additionally, we managed to save up to 4Gb of memory due to reduction in number of saved activations.