MobileBERT Paper Summary

Sep-23-2020, 18:10:07 GMT–#artificialintelligence

As the size of the NLP model increases into the hundreds of billions of parameters, so does the importance of being able to create more compact representations of these models. Knowledge distillation has successfully enabled this but is still considered an afterthought when designing the teacher models. This probably reduces the effectiveness of the distillation, leaving potential performance improvements for the student on the table. Further, the difficulties in fine-tuning small student models after the initial distillation, without degrading their performance, requires us to both pre-train and fine-tune the teachers on the tasks we want the student to be able to perform. Training a student model through knowledge distillation will, therefore, require more training compared to only training the teacher, which limits the benefits of a student model to inference-time.

artificial intelligence, distillation, natural language, (6 more...)

#artificialintelligence

Sep-23-2020, 18:10:07 GMT

News Web Page

Add feedback

Industry:
- Education (0.99)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found