Distributional Properties of Subword Regularization

Cognetta, Marco, Zouhar, Vilém, Okazaki, Naoaki

Aug-21-2024–arXiv.org Artificial Intelligence

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

dropout, tokenization, tokenizer, (14 more...)

arXiv.org Artificial Intelligence

Aug-21-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - Pennsylvania (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Switzerland > Zürich
    - Zürich (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture
      - Tokyo (0.04)
    - Kansai > Kyoto Prefecture
      - Kyoto (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found