Neural Text Normalization for Luxembourgish using Real-Life Variation Data
Lutgen, Anne-Marie, Plum, Alistair, Purschke, Christoph, Plank, Barbara
–arXiv.org Artificial Intelligence
Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
arXiv.org Artificial Intelligence
Dec-13-2024
- Country:
- Africa > Middle East
- Morocco (0.04)
- Asia
- Middle East > Iraq
- Babil Governorate > Hillah (0.04)
- Singapore (0.04)
- Middle East > Iraq
- Europe
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Spain (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany
- Bavaria > Upper Bavaria
- Munich (0.04)
- Berlin (0.04)
- Bavaria > Upper Bavaria
- Italy > Tuscany
- Florence (0.04)
- Middle East > Malta
- Port Region > Southern Harbour District > Valletta (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Bulgaria > Varna Province
- Varna (0.04)
- Sweden > Vaestra Goetaland
- North America > United States
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Oregon (0.04)
- New Mexico > Santa Fe County
- Africa > Middle East
- Genre:
- Research Report (0.40)
- Industry:
- Education (0.68)
- Technology: