Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Downey, C. M., Blevins, Terra, Goldfine, Nora, Steinert-Threlkeld, Shane

Oct-26-2023–arXiv.org Artificial Intelligence

Additionally, the informationtheoretic For languages other than English and a handful tokenization modules for cross-lingual of other very high-resource languages, pre-trained models are usually under-optimized for any given multilingual language models form the backbone language, and especially low-resource languages of most current NLP systems. These models address (Ács, 2019; Conneau and Lample, 2019, i.a.) the relative data scarcity in most non-English For this reason, we propose several simple techniques languages by pooling text data across many languages to replace the large cross-lingual vocabulary to train a single model that (in theory) covers of a pre-trained model with a compact, languagespecific all training languages (Devlin, 2019; Conneau one during model specialization. Training and Lample, 2019; Conneau et al., 2020; Liu et al., a new SentencePiece or BPE tokenizer poses no 2020; Scao et al., 2023, i.a.).

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-26-2023

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States (0.28)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found