Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection
Zhou, Xuanru, Lian, Jiachen, Cho, Cheol Jun, Liu, Jingwen, Ye, Zongli, Zhang, Jinming, Morin, Brittany, Baquirin, David, Vonk, Jet, Ezzes, Zoe, Miller, Zachary, Tempini, Maria Luisa Gorno, Anumanchipalli, Gopala
–arXiv.org Artificial Intelligence
Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/
arXiv.org Artificial Intelligence
Sep-20-2024
- Country:
- Europe > Netherlands (0.14)
- Genre:
- Research Report (0.64)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Technology: