Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models
Qharabagh, Mahta Fetrat, Dehghanian, Zahra, Rabiee, Hamid R.
–arXiv.org Artificial Intelligence
Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30% improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.
arXiv.org Artificial Intelligence
May-20-2025
- Country:
- Asia > Middle East
- Iran (0.04)
- Europe
- Portugal > Braga
- Braga (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Portugal > Braga
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States > Illinois (0.04)
- Mexico > Mexico City
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (0.93)
- Law (1.00)
- Technology: