HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
Özeren, Enes, Liu, Yihong, Schütze, Hinrich
–arXiv.org Artificial Intelligence
Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.
arXiv.org Artificial Intelligence
Jul-15-2025
- Country:
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Asia
- China (0.04)
- Japan > Honshū
- Kansai > Kyoto Prefecture > Kyoto (0.04)
- Macao (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Singapore (0.04)
- South Korea (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- France (0.04)
- Germany
- Bavaria > Upper Bavaria
- Munich (0.04)
- Berlin (0.04)
- Bavaria > Upper Bavaria
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.14)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Oceania > Australia
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Africa > Ethiopia
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Health & Medicine (0.46)
- Technology: