BiT: Robustly Binarized Multi-distilled Transformer

Liu, Zechun, Oguz, Barlas, Pappu, Aasish, Xiao, Lin, Yih, Scott, Li, Meng, Krishnamoorthi, Raghuraman, Mehdad, Yashar

Oct-2-2022–arXiv.org Artificial Intelligence

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-2-2022

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Industry:
- Education (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found