Question Answering with Python, HuggingFace Transformers and Machine Learning – MachineCurve
If you would like to read about DistilBERT in more detail I'd suggest clicking here for the article, but from what the abstract suggests it was made 60% faster by performing a 40% size reduction while retaining 97% of its language understanding. This is a significant improvement and a great optimization with respect to traditional or'vanilla' BERT. As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
Sep-9-2021, 11:05:50 GMT