Neural Text Normalization for Luxembourgish using Real-Life Variation Data
Lutgen, Anne-Marie, Plum, Alistair, Purschke, Christoph, Plank, Barbara
–arXiv.org Artificial Intelligence
Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
arXiv.org Artificial Intelligence
Dec-13-2024
- Country:
- Europe > Germany (0.28)
- North America > United States (0.46)
- Genre:
- Research Report (0.40)
- Industry:
- Education (0.46)
- Technology: