Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Downey, C. M., Blevins, Terra, Goldfine, Nora, Steinert-Threlkeld, Shane

arXiv.org Artificial Intelligence 

Additionally, the informationtheoretic For languages other than English and a handful tokenization modules for cross-lingual of other very high-resource languages, pre-trained models are usually under-optimized for any given multilingual language models form the backbone language, and especially low-resource languages of most current NLP systems. These models address (Ács, 2019; Conneau and Lample, 2019, i.a.) the relative data scarcity in most non-English For this reason, we propose several simple techniques languages by pooling text data across many languages to replace the large cross-lingual vocabulary to train a single model that (in theory) covers of a pre-trained model with a compact, languagespecific all training languages (Devlin, 2019; Conneau one during model specialization. Training and Lample, 2019; Conneau et al., 2020; Liu et al., a new SentencePiece or BPE tokenizer poses no 2020; Scao et al., 2023, i.a.).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found