A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

Open in new window