Linguistically Informed Tokenization Improves ASR for Underresourced Languages
Daul, Massimo, Tosolini, Alessio, Bowern, Claire
–arXiv.org Artificial Intelligence
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.
arXiv.org Artificial Intelligence
Oct-9-2025
- Country:
- Europe
- France > Auvergne-Rhône-Alpes
- Netherlands > Gelderland
- Arnhem (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- North America > United States
- Florida > Miami-Dade County > Miami (0.04)
- Oceania > Australia
- Northern Territory (0.04)
- Europe
- Genre:
- Research Report (1.00)
- Technology: