Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models
Lee, Kyowoon, Stitsyuk, Artyom, Jho, Gunu, Hwang, Inchul, Choi, Jaesik
–arXiv.org Artificial Intelligence
Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.
arXiv.org Artificial Intelligence
Jun-3-2025
- Country:
- Asia > South Korea (0.05)
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- Genre:
- Research Report > New Finding (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks (1.00)
- Statistical Learning (0.69)
- Natural Language (1.00)
- Speech (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence