prosodic feature
"Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Ngo, Anh, Rollet, Nicolas, Pelachaud, Catherine, Clavel, Chloe
Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.
Multi-task Pretraining for Enhancing Interpretable L2 Pronunciation Assessment
Li, Jiun-Ting, Yan, Bi-Cheng, Wang, Yi-Cheng, Chen, Berlin
Most existing efforts on APA typically adopt segmental-level features as inputs and predict pronunciation scores at different granularities via hierarchical (or parallel) pronunciation modeling. This, however, inevitably causes assessments across linguistic levels (e.g., phone, word, and utterance) to rely solely on phoneme-level pronunciation features, nearly sidelining supra-segmental pronunciation cues. T o address this limitation, we introduce multi-task pre-training (MTP) for APA, a simple yet effective strategy that attempts to capture long-term temporal pronunciation cues while strengthening the intrinsic structures within an utterance via the objective of reconstructing input features. Specifically, for a phoneme-level encoder of an APA model, the proposed MTP strategy randomly masks segmental-level pronunciation features and reconstructs the masked ones based on their surrounding pronunciation context. Furthermore, current APA systems lack integration with automated speaking assessment (ASA), limiting holistic proficiency evaluation. Drawing on empirical studies and prior knowledge in ASA, our framework bridges this gap by incorporating handcrafted features (HCFs), such as fluency (speech rate, silence duration) and stress (pitch accent strength), derived from human-designed formulas via regressors to generate interpretable proficiency scores. Experiments on speechocean762 show improved pronunciation scoring and ASA proficiency correlation, enabling targeted training and comprehensive proficiency assessment. Index T erms--computer-assisted language learning, automatic pronunciation assessment, automated speaking assessment, multi-task learning.
Improving French Synthetic Speech Quality via SSML Prosody Control
Ouali, Nassima Ould, Sani, Awais Hussain, Bueno, Ruben, Dauvet, Jonah, Horstmann, Tim Luka, Moulines, Eric
Despite recent advances, synthetic voices often lack expressiveness due to limited prosody control in commercial text-to-speech (TTS) systems. We introduce the first end-to-end pipeline that inserts Speech Synthesis Markup Language (SSML) tags into French text to control pitch, speaking rate, volume, and pause duration. We employ a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one predicts phrase-break positions and the other performs regression on prosodic targets, generating commercial TTS-compatible SSML markup. Evaluated on a 14-hour French podcast corpus, our method achieves 99.2% F1 for break placement and reduces mean absolute error on pitch, rate, and volume by 25-40% compared with prompting-only large language models (LLMs) and a BiLSTM baseline. In perceptual evaluation involving 18 participants across over 9 hours of synthesized audio, SSML-enhanced speech generated by our pipeline significantly improves naturalness, with the mean opinion score increasing from 3.20 to 3.87 (p < 0.005). Additionally, 15 of 18 listeners preferred our enhanced synthesis. These results demonstrate substantial progress in bridging the expressiveness gap between synthetic and natural French speech. Our code is publicly available at https://github.com/hi-paris/Prosody-Control-French-TTS.
ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs
Eren, Eray, Liu, Qingju, Kim, Hyeongwoo, Garrido, Pablo, Alwan, Abeer
Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the Gi-gaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model's potential in tasks where prosody modeling is important.
The Prosody of Emojis
Zhou, Giulio, Lam, Tsz Kin, Birch, Alexandra, Haddow, Barry
Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emoji by analysing actual human speech data, collected through structured but open-ended production and perception tasks. This provides empirical evidence of how emoji semantics shape spoken delivery and perception. Results show that speakers adapt their prosody based on emoji cues, listeners can often identify the intended emoji from prosodic variation alone, and greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis can act as meaningful carriers of prosodic intent, offering insight into their communicative role in digitally mediated contexts.
Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody
Sasu, David, Yamoah, Kweku Andoh, Quartey, Benedict, Schluter, Natalie
Enabling robots to accurately interpret and execute spoken language instructions is essential for effective human-robot collaboration. Traditional methods rely on speech recognition to transcribe speech into text, often discarding crucial prosodic cues needed for disambiguating intent. We propose a novel approach that directly leverages speech prosody to infer and resolve instruction intent. Predicted intents are integrated into large language models via in-context learning to disambiguate and select appropriate task plans. Additionally, we present the first ambiguous speech dataset for robotics, designed to advance research in speech disambiguation. Our method achieves 95.79% accuracy in detecting referent intents within an utterance and determines the intended task plan of ambiguous instructions with 71.96% accuracy, demonstrating its potential to significantly improve human-robot communication.
Using Information Theory to Characterize Prosodic Typology: The Case of Tone, Pitch-Accent and Stress-Accent
Wilcox, Ethan Gotlieb, Ding, Cui, Acampa, Giovanni, Pimentel, Tiago, Warstadt, Alex, Regev, Tamar I.
This paper argues that the relationship between lexical identity and prosody -- one well-studied parameter of linguistic variation -- can be characterized using information theory. We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don't. We test this hypothesis in the domain of pitch, which is used to make lexical distinctions in tonal languages, like Cantonese. We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves. We find that, across languages, pitch curves display similar amounts of entropy. However, these curves are easier to predict given their associated text in the tonal languages, compared to pitch- and stress-accent languages, and thus the mutual information is higher in these languages, supporting our hypothesis. Our results support perspectives that view linguistic typology as gradient, rather than categorical.
WHISTRESS: Enriching Transcriptions with Sentence Stress Detection
Yosha, Iddo, Shteyman, Dorin, Adi, Yossi
Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: https://pages.cs.huji.ac.il/adiyoss-lab/whistress.
The time scale of redundancy between prosody and linguistic context
Regev, Tamar I., Ohams, Chiebuka, Xie, Shaylee, Wolf, Lukas, Fedorenko, Evelina, Warstadt, Alex, Wilcox, Ethan G., Pimentel, Tiago
In spoken language, speakers transmit information not only using words, but also via a rich array of non-verbal signals, which include prosody -- the auditory features of speech. However, previous studies have shown that prosodic features exhibit significant redundancy with both past and future words. Here, we examine the time scale of this relationship: How many words in the past (or future) contribute to predicting prosody? We find that this scale differs for past and future words. Prosody's redundancy with past words extends across approximately 3-8 words, whereas redundancy with future words is limited to just 1-2 words. These findings indicate that the prosody-future relationship reflects local word dependencies or short-scale processes such as next word prediction, while the prosody-past relationship unfolds over a longer time scale. The latter suggests that prosody serves to emphasize earlier information that may be challenging for listeners to process given limited cognitive resources in real-time communication. Our results highlight the role of prosody in shaping efficient communication.
The Role of Prosody in Spoken Question Answering
Chi, Jie, de Seyssel, Maureen, Schluter, Natalie
Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody--additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features.