LM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
–Neural Information Processing Systems
The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher).
Neural Information Processing Systems
Oct-2-2025, 18:08:19 GMT
- Country:
- Asia > China
- Hong Kong (0.04)
- Europe
- North America
- Canada (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Texas > Travis County
- Austin (0.04)
- California > San Diego County
- Oceania > Australia
- Asia > China
- Industry:
- Education (0.73)
- Technology: