Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages
Downey, C. M., Blevins, Terra, Goldfine, Nora, Steinert-Threlkeld, Shane
–arXiv.org Artificial Intelligence
Additionally, the informationtheoretic For languages other than English and a handful tokenization modules for cross-lingual of other very high-resource languages, pre-trained models are usually under-optimized for any given multilingual language models form the backbone language, and especially low-resource languages of most current NLP systems. These models address (Ács, 2019; Conneau and Lample, 2019, i.a.) the relative data scarcity in most non-English For this reason, we propose several simple techniques languages by pooling text data across many languages to replace the large cross-lingual vocabulary to train a single model that (in theory) covers of a pre-trained model with a compact, languagespecific all training languages (Devlin, 2019; Conneau one during model specialization. Training and Lample, 2019; Conneau et al., 2020; Liu et al., a new SentencePiece or BPE tokenizer poses no 2020; Scao et al., 2023, i.a.).
arXiv.org Artificial Intelligence
Oct-26-2023
- Country:
- Asia
- Middle East > Israel (0.04)
- Russia (0.04)
- Europe
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Russia (0.04)
- Croatia > Dubrovnik-Neretva County
- North America
- Canada > British Columbia
- Dominican Republic (0.04)
- United States > Washington
- King County > Seattle (0.04)
- Asia
- Genre:
- Research Report (0.82)
- Technology: