Neural Text Normalization for Luxembourgish using Real-Life Variation Data

Lutgen, Anne-Marie, Plum, Alistair, Purschke, Christoph, Plank, Barbara

Dec-13-2024–arXiv.org Artificial Intelligence

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Dec-13-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Oregon (0.04)
  - New Mexico > Santa Fe County
    - Santa Fe (0.04)
- Europe
  - Spain (0.04)
  - Bulgaria > Varna Province
    - Varna (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)
  - Middle East > Malta
    - Port Region > Southern Harbour District > Valletta (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Germany
    - Berlin (0.04)
    - Bavaria > Upper Bavaria
      - Munich (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - United Kingdom > England
    - Oxfordshire > Oxford (0.04)
  - Sweden > Vaestra Goetaland
    - Gothenburg (0.04)
- Asia
  - Singapore (0.04)
  - Middle East > Iraq
    - Babil Governorate > Hillah (0.04)
- Africa > Middle East
  - Morocco (0.04)

Genre:
- Research Report (0.40)

Industry:
- Education (0.68)

Technology:
- Information Technology
  - Communications (1.00)
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Natural Language > Text Processing (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found