Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Daul, Massimo, Tosolini, Alessio, Bowern, Claire

Oct-9-2025–arXiv.org Artificial Intelligence

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-9-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- North America > United States (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found