MorphPiece : Moving away from Statistical Language Representation

Jul-14-2023–arXiv.org Artificial Intelligence

Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. We propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows superior convergence compared to the same architecture trained on a standard BPE tokenizer. Specifically we get Language Modeling performance comparable to a 6 times larger model. Additionally, we evaluate MorphGPT on a variety of NLP tasks in supervised and unsupervised settings and find superior performance across the board, compared to GPT-2 model.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Jul-14-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - New York > New York County
      - New York City (0.04)
    - Massachusetts > Suffolk County
      - Boston (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - California > San Diego County
      - San Diego (0.04)
- Europe
  - Sweden
    - Vaestra Goetaland > Gothenburg (0.04)
    - Uppsala County > Uppsala (0.04)
  - Germany
    - Berlin (0.04)
    - Hesse > Darmstadt Region
      - Darmstadt (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)

Genre:
- Research Report (0.64)
- Instructional Material (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found