From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Goriely, Zébulon, Martinez, Richard Diehl, Caines, Andrew, Beinborn, Lisa, Buttery, Paula

Oct-30-2024–arXiv.org Artificial Intelligence

Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

computational linguistic, language model, representation, (14 more...)

arXiv.org Artificial Intelligence

Oct-30-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- North America
  - United States
    - District of Columbia > Washington (0.04)
    - Washington > King County
      - Seattle (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
  - Mexico > Mexico City
    - Mexico City (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Italy (0.04)
  - Croatia (0.04)
  - Bulgaria (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.28)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Germany
    - Lower Saxony > Gottingen (0.14)
    - Berlin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)

Genre:
- Research Report > New Finding (0.86)

Industry:
- Information Technology (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)