Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression
Sander, Jacob, Moe, David, Cohen, Achraf, Venable, Brent, Dasari, Venkat, Jalaian, Brian
–arXiv.org Artificial Intelligence
Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art pruning schemes target the entire Transformer, we adopt a simple, layer-wise L2-norm pruning on only the MLP blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) Fine-tuning with Cross- Entropy (L2PFT), which requires labeled data, versus (ii) Self-Distillation with KL-divergence, which leverages only teacher logits (no labels) (L2PSD). We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, KL-based distillation matches or exceeds CE fine-tuning in test accuracy, demonstrating that, even with a basic MLP-only pruning, the choice of loss function materially affects compressed model recovery in resource-constrained environments.
arXiv.org Artificial Intelligence
May-27-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States
- Florida > Escambia County
- Pensacola (0.05)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Florida > Escambia County
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.48)
- Technology: