Adapting Multilingual Embedding Models to Historical Luxembourgish
Michail, Andrianos, Raclé, Corina Julia, Opitz, Juri, Clematide, Simon
–arXiv.org Artificial Intelligence
The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98\% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.
arXiv.org Artificial Intelligence
Feb-11-2025
- Country:
- Asia > Middle East
- UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe
- North America > United States (0.14)
- Asia > Middle East
- Genre:
- Research Report (0.50)
- Technology: