IteRABRe: Iterative Recovery-Aided Block Reduction
Wibowo, Haryo Akbarianto, Song, Haiyue, Tanaka, Hideki, Utiyama, Masao, Aji, Alham Fikri, Dabre, Raj
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.
arXiv.org Artificial Intelligence
Mar-8-2025
- Country:
- Asia (0.46)
- Europe (0.46)
- North America > United States (0.28)
- Genre:
- Research Report (1.00)
- Industry:
- Transportation (0.40)
- Technology: