Modeling Orthographic Variation in Occitan's Dialects
Hopton, Zachary William, Aepli, Noëmi
–arXiv.org Artificial Intelligence
Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv.org Artificial Intelligence
Apr-30-2024
- Country:
- North America
- Dominican Republic (0.04)
- United States
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- New Mexico > Santa Fe County
- Europe
- Italy (0.04)
- Germany > Berlin (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Asia
- North America
- Genre:
- Research Report > New Finding (1.00)
- Technology: