SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

Chekalina, Viktoriia, Rudenko, Anna, Mezentsev, Gleb, Mikhalev, Alexander, Panchenko, Alexander, Oseledets, Ivan

Oct-9-2024–arXiv.org Artificial Intelligence

The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1\% of the layer's elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.

fine-tuning, mlp block, sparsegrad, (15 more...)

arXiv.org Artificial Intelligence

Oct-9-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - Alaska > Anchorage Municipality
    - Anchorage (0.04)
- Europe > Romania
  - Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia
  - Japan (0.06)
  - China (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.91)
  - Machine Learning > Neural Networks
    - Deep Learning (0.91)