Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

Jan-18-2025, 16:09:29 GMT–Neural Information Processing Systems

Traditional knowledge distillation (KD) methods manually design student architectures to compress large models given pre-specified computational cost. This requires several trials to find viable students, and repeating the process with change in computational budget. We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Existing NAS methods train a single SuperLM consisting of millions of subnetworks with weight-sharing, resulting in interference between subnetworks of different sizes. Additionally, many of these works are task-specific requiring task labels for SuperLM training.

few-shot task-agnostic neural architecture search, language model, student, (4 more...)

Neural Information Processing Systems

Jan-18-2025, 16:09:29 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science (0.77)
  - Systems & Languages > Problem-Independent Architectures (0.64)
  - Machine Learning > Neural Networks (0.64)
  - Natural Language > Large Language Model (0.40)