BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

Matzopoulos, Alexis, Hendriks, Charl, Mahomed, Hishaam, Meyer, Francois

Jan-7-2025–arXiv.org Artificial Intelligence

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jan-7-2025

arXiv.org PDF

Add feedback

Country:
- Africa
  - Middle East > Morocco (0.04)
  - South Africa
    - Gauteng > Soweto (0.04)
    - Western Cape > Cape Town (0.04)
- Asia
  - Indonesia > Bali (0.04)
  - Middle East
    - Israel (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
  - Singapore (0.05)
- Europe
  - Croatia (0.04)
  - Italy > Tuscany
    - Florence (0.04)
- North America
  - Canada > Ontario
    - Toronto (0.04)
  - Dominican Republic (0.04)
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Large Language Model (0.46)
    - Text Processing (0.72)