MorphPiece : Moving away from Statistical Language Representation
–arXiv.org Artificial Intelligence
Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. We propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows superior convergence compared to the same architecture trained on a standard BPE tokenizer. Specifically we get Language Modeling performance comparable to a 6 times larger model. Additionally, we evaluate MorphGPT on a variety of NLP tasks in supervised and unsupervised settings and find superior performance across the board, compared to GPT-2 model.
arXiv.org Artificial Intelligence
Jul-14-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > San Diego County
- San Diego (0.04)
- Washington > King County
- Europe
- Sweden
- Vaestra Goetaland > Gothenburg (0.04)
- Uppsala County > Uppsala (0.04)
- Germany
- Berlin (0.04)
- Hesse > Darmstadt Region
- Darmstadt (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Sweden
- North America
- Genre:
- Research Report (0.64)
- Instructional Material (0.46)
- Technology: