GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge
Gurgurov, Daniil, Kumar, Rishu, Ostermann, Simon
–arXiv.org Artificial Intelligence
Contextualized embeddings based on large language models (LLMs) are available for various languages, but their coverage is often limited for lower resourced languages. Using LLMs for such languages is often difficult due to a high computational cost; not only during training, but also during inference. Static word embeddings are much more resource-efficient ("green"), and thus still provide value, particularly for very low-resource languages. There is, however, a notable lack of comprehensive repositories with such embeddings for diverse languages. To address this gap, we present GrEmLIn, a centralized repository of green, static baseline embeddings for 87 mid- and low-resource languages. We compute GrEmLIn embeddings with a novel method that enhances GloVe embeddings by integrating multilingual graph knowledge, which makes our static embeddings competitive with LLM representations, while being parameter-free at inference time. Our experiments demonstrate that GrEmLIn embeddings outperform state-of-the-art contextualized embeddings from E5 on the task of lexical similarity. They remain competitive in extrinsic evaluation tasks like sentiment analysis and natural language inference, with average performance gaps of just 5-10\% or less compared to state-of-the-art models, given a sufficient vocabulary overlap with the target task, and underperform only on topic classification. Our code and embeddings are publicly available at https://huggingface.co/DFKI.
arXiv.org Artificial Intelligence
Dec-11-2024
- Country:
- Africa > Eswatini
- Asia
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Sweden > Östergötland County
- Linköping (0.04)
- Lithuania > Kaunas County
- Kaunas (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Germany > Berlin (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Bulgaria (0.04)
- Poland > Greater Poland Province
- Poznań (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico (0.04)
- Oregon (0.04)
- Louisiana > Orleans Parish
- Canada
- Genre:
- Research Report
- New Finding (0.67)
- Promising Solution (0.68)
- Research Report
- Technology: