Binarized Neural Machine Translation

Zhang, Yichi, Garg, Ankush, Cao, Yuan, Lew, Łukasz, Ghorbani, Behrooz, Zhang, Zhiru, Firat, Orhan

Feb-9-2023–arXiv.org Artificial Intelligence

The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residual connections to improve binarization quality. Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings. Implementation in JAX/Flax will be open sourced.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Feb-9-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Philadelphia County
    - Philadelphia (0.04)
  - Massachusetts > Suffolk County
    - Boston (0.04)
- Europe
  - Spain (0.04)
  - Poland (0.04)
  - Belgium (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)