Adapting Multilingual Embedding Models to Historical Luxembourgish

Michail, Andrianos, Raclé, Corina Julia, Opitz, Juri, Clematide, Simon

Feb-11-2025–arXiv.org Artificial Intelligence

The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98\% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Feb-11-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Florida > Miami-Dade County > Miami (0.04)
- Europe
  - Switzerland > Zürich
    - Zürich (0.05)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (0.70)
    - Machine Translation (0.47)
  - Machine Learning > Neural Networks
    - Deep Learning (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found