HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization

Zhao, Huaqin, Li, Jiaxi, Pan, Yi, Liang, Shizhe, Yang, Xiaofeng, Liu, Wei, Li, Xiang, Dou, Fei, Liu, Tianming, Lu, Jin

Nov-15-2024–arXiv.org Artificial Intelligence

Fine-tuning large language models (LLMs) poses significant memory challenges, as the back-propagation process demands extensive resources, especially with growing model sizes. Recent work, MeZO, addresses this issue using a zerothorder (ZO) optimization method, which reduces memory consumption by matching the usage to the inference phase. To overcome this limitation, we introduce HELENE, a novel scalable and memory-efficient optimizer that integrates annealed A-GNB gradients with a diagonal Hessian estimation and layerwise clipping, serving as a second-order pre-conditioner. This combination allows for faster and more stable convergence. Our theoretical analysis demonstrates that HELENE improves convergence rates, particularly for models with heterogeneous layer dimensions, by reducing the dependency on the total parameter space dimension. Furthermore, HELENE remains compatible with both full parameter tuning and parameter-efficient fine-tuning (PEFT), outperforming several state-of-the-art optimizers. The codes will be released after reviewing. LLMs have demonstrated remarkable capabilities across various downstream tasks. Fine-tuning these models has become the standard approach for improving task-specific performance, in which the firstorder optimizers like Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951), Adam (Diederik, 2014) and AdamW (Hutter & Loshchilov, 2017) are widely used.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-15-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Statistical Learning > Gradient Descent (0.68)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)