OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining
Liu, Yihong, Lin, Peiqin, Wang, Mingyang, Schütze, Hinrich
–arXiv.org Artificial Intelligence
Pretraining multilingual language models from scratch requires considerable computational resources and substantial training data. Therefore, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the language model, thus weakening the efficiency. To address these issues, we propose a novel framework: \textbf{O}ne \textbf{F}or \textbf{A}ll (\textbf{\textsc{Ofa}}), which wisely initializes the embeddings of unseen subwords from target languages and thus can adapt a PLM to multiple languages efficiently and effectively. \textsc{Ofa} takes advantage of external well-aligned multilingual word embeddings and injects the alignment knowledge into the new embeddings. In addition, \textsc{Ofa} applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which significantly reduces the number of parameters while not sacrificing the performance. Through extensive experiments, we show models initialized by \textsc{Ofa} are efficient and outperform several baselines. \textsc{Ofa} not only accelerates the convergence of continued pretraining, which is friendly to a limited computation budget, but also improves the zero-shot crosslingual transfer on a wide range of downstream tasks. We make our code and models publicly available.
arXiv.org Artificial Intelligence
Nov-15-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- California > San Diego County
- San Diego (0.04)
- Washington > King County
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- Italy (0.04)
- Austria (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- China > Hong Kong (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- North America
- Genre:
- Research Report (0.64)
- Technology: