Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art
Gao, Yuan, Srivastava, Brij Mohan Lal, Salsman, James
ABSTRACT We use automatic speech recognition to assess spoken English learner pronunciation based on the authentic intelligibility of the learners' spoken responses determined from support vector machine (SVM) classifier or deep learning neural network model predictions of transcription correctness. Using numeric features produced by PocketSphinx alignment mode and many recognition passes searching for the substitution and deletion of each expected phoneme and insertion of unexpected phonemes in sequence, the SVM models achieve 82% agreement with the accuracy of Amazon Mechanical Turk crowdworker transcriptions, up from 75% reported by multiple independent researchers. Using such features with SVM classifier probability prediction models can help computeraided pronunciation teaching (CAPT) systems provide intelligibility remediation. Index Terms-- phoneme alignment, pronunciation assessment, computer aided language learning, binary features 1. INTRODUCTION Authentic intelligibility, the ability of listeners to correctly transcribe recorded utterances, initially used for CAPT by [1] and [2], is a better measure of pronunciation assessment for spoken language learners compared to mispronunciations identified by expert pronunciation judges or panels of experts, because such mispronunciations are associated with only 16% of intelligibility problems, according to [3], who state: We investigated... which words are likely to be misrecognized and which words are likely to be marked as pronunciation errors. Words perceived as mispronounced remain intelligible in about half of all cases.
Jan-26-2018