Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Arpa, Zaara Zabeen, Apurbo, Sadnam Sakib, Oishee, Nazia Karim Khan, Abrar, Ajwad

arXiv.org Artificial Intelligence 

Automatic Speech Recognition (ASR) is now integral to digital interaction, driving applications from virtual assistants to automated subtitles on platforms like YouTube ( Dykes et al. 2023). Despite its wide adoption and significant advancements, ASR performance remains imperfect, particularly in Large Vocabulary Continuous Speech Recognition (L VCSR) and under real-world conditions like background noise or diverse accents ( Errattahi et al. 2018; Romana et al. 2024). This often results in a significant Word Error Rate (WER). A major, persistent source of error stems from speech disfluencies, which are interruptions in the smooth flow of speech, including filled pauses, hesitations, self-corrections, and most relevantly, repetitions ( Romana et al. 2024). Disfluencies are natural and frequent; one study found a 50% probability of a disfluency in a 10-13 word sentence ( Jamshid Lou and Johnson 2020b). Their presence creates noisy transcripts that are difficult to read and detrimental to downstream Natural Language Processing (NLP) tasks such as machine translation or information extraction ( Romana et al. 2024; Jamshid Lou and Johnson 2020a).