gaHealth: An English-Irish Bilingual Corpus of Health Data

Lankford, Séamus, Afli, Haithem, Loinsigh, Órla Ní, Way, Andy

Mar-6-2024–arXiv.org Artificial Intelligence

Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets.

corpus, dataset, translation, (12 more...)

arXiv.org Artificial Intelligence

Mar-6-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Europe
  - Czechia > Prague (0.04)
  - Bulgaria (0.04)
  - Ireland
    - Munster > County Cork
      - Cork (0.04)
    - Leinster > County Dublin
      - Dublin (0.04)
  - Finland > Uusimaa
    - Helsinki (0.04)

Genre:
- Research Report (0.64)

Industry:
- Government (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found