Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Mar-12-2024–arXiv.org Artificial Intelligence

Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.

computational linguistic, dataset, proceedings, (16 more...)

arXiv.org Artificial Intelligence

Mar-12-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Pennsylvania > Philadelphia County
      - Philadelphia (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - California
      - San Francisco County > San Francisco (0.14)
      - San Diego County > San Diego (0.04)
  - Puerto Rico > San Juan
    - San Juan (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - France (0.04)
  - Spain (0.04)
  - Czechia > Prague (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Germany
    - Berlin (0.04)
    - Saarland > Saarbrücken (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Finland > Southwest Finland
    - Turku (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Sweden > Östergötland County
    - Linköping (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Africa
  - Ethiopia (0.05)
  - Niger (0.04)
  - South Africa > Western Cape
    - Cape Town (0.05)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found