YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

Zhou, Xuanru, Kashyap, Anshul, Li, Steve, Sharma, Ayati, Morin, Brittany, Baquirin, David, Vonk, Jet, Ezzes, Zoe, Miller, Zachary, Tempini, Maria Luisa Gorno, Lian, Jiachen, Anumanchipalli, Gopala Krishna

Sep-15-2024–arXiv.org Artificial Intelligence

Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter

artificial intelligence, dysfluency, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

Sep-15-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.04)
- North America
  - United States > California
    - San Francisco County > San Francisco (0.14)
    - Alameda County > Berkeley (0.04)
  - Canada > Quebec
    - Montreal (0.04)

Genre:
- Research Report (0.84)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Rule-Based Reasoning (1.00)
  - Speech > Speech Recognition (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found