Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Zhou, Xuanru, Lian, Jiachen, Cho, Cheol Jun, Liu, Jingwen, Ye, Zongli, Zhang, Jinming, Morin, Brittany, Baquirin, David, Vonk, Jet, Ezzes, Zoe, Miller, Zachary, Tempini, Maria Luisa Gorno, Anumanchipalli, Gopala

Sep-20-2024–arXiv.org Artificial Intelligence

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/

artificial intelligence, machine learning, speech recognition, (17 more...)

arXiv.org Artificial Intelligence

Sep-20-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Netherlands (0.14)

Genre:
- Research Report (0.64)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Speech > Speech Recognition (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found