Bridging the Gap for Tokenizer-Free Language Models
Choe, Dokook, Al-Rfou, Rami, Guo, Mandy, Lee, Heeyoung, Constant, Noah
–arXiv.org Artificial Intelligence
Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the model is essential to achieving competitive results. In this paper, we show that contrary to this conventional wisdom, tokenizer-free LMs with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve a new state of the art for tokenizer-free LMs, pushing these models to be on par with their word-based counterparts.
arXiv.org Artificial Intelligence
Aug-27-2019
- Country:
- Europe (0.46)
- North America > United States
- California (0.14)
- Genre:
- Research Report (0.64)
- Technology: