Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation
–arXiv.org Artificial Intelligence
Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.
arXiv.org Artificial Intelligence
Mar-12-2024
- Country:
- Oceania > Australia
- North America
- United States
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- California
- San Francisco County > San Francisco (0.14)
- San Diego County > San Diego (0.04)
- Washington > King County
- Puerto Rico > San Juan
- San Juan (0.04)
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- United States
- Europe
- France (0.04)
- Spain (0.04)
- Czechia > Prague (0.04)
- Italy > Tuscany
- Florence (0.04)
- Germany
- Berlin (0.04)
- Saarland > Saarbrücken (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Finland > Southwest Finland
- Turku (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Sweden > Östergötland County
- Linköping (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- China > Hong Kong (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Africa
- Ethiopia (0.05)
- Niger (0.04)
- South Africa > Western Cape
- Cape Town (0.05)
- Genre:
- Research Report > New Finding (0.48)
- Technology: