Bridging the Gap for Tokenizer-Free Language Models

Choe, Dokook, Al-Rfou, Rami, Guo, Mandy, Lee, Heeyoung, Constant, Noah

Aug-27-2019–arXiv.org Artificial Intelligence

Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the model is essential to achieving competitive results. In this paper, we show that contrary to this conventional wisdom, tokenizer-free LMs with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve a new state of the art for tokenizer-free LMs, pushing these models to be on par with their word-based counterparts.

artificial intelligence, computational linguistics, text processing, (18 more...)

arXiv.org Artificial Intelligence

Aug-27-2019

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- North America > United States
  - California (0.14)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found