PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
He, Jiajun, Sawada, Naoki, Miyazaki, Koichi, Toda, Tomoki
–arXiv.org Artificial Intelligence
Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.
arXiv.org Artificial Intelligence
Sep-5-2025
- Country:
- Asia > Japan > Honshū
- Chūbu > Aichi Prefecture
- Nagoya (0.40)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.14)
- Chūbu > Aichi Prefecture
- Asia > Japan > Honshū
- Genre:
- Research Report (0.64)
- Industry:
- Education > Educational Setting > Higher Education (0.40)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (0.47)
- Performance Analysis > Accuracy (0.49)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence