AITopics | prosody

Collaborating Authors

prosody

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

8a9c8ac001d3ef9e4ce39b1177295e03-Paper.pdf

Neural Information Processing SystemsFeb-19-2026, 05:47:34 GMT

Dubbing is a post-production process of re-recording actors' dialogues, which isextensively used infilmmaking and video production.

artificial intelligence, machine learning, speech, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Neural Dubber: Dubbing for Videos According to Scripts

Neural Information Processing SystemsFeb-9-2026, 18:27:49 GMT

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

artificial intelligence, machine learning, neural dubber, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Neural Information Processing SystemsDec-24-2025, 07:34:19 GMT

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture.

name change, portable and high-quality generative text-to-speech, portaspeech, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Learning When to Ask: Simulation-Trained Humanoids for Mental-Health Diagnosis

Cenacchi, Filippo, Richards, Deborah, Cao, Longbing

arXiv.org Artificial IntelligenceDec-11-2025

Testing humanoid robots with users is slow, causes wear, and limits iteration and diversity. Yet screening agents must master conversational timing, prosody, backchannels, and what to attend to in faces and speech for Depression and PTSD. Most simulators omit policy learning with nonverbal dynamics; many controllers chase task accuracy while underweighting trust, pacing, and rapport. We virtualise the humanoid as a conversational agent to train without hardware burden. Our agent-centred, simulation-first pipeline turns interview data into 276 Unreal Engine MetaHuman patients with synchronised speech, gaze/face, and head-torso poses, plus PHQ-8 and PCL-C flows. A perception-fusion-policy loop decides what and when to speak, when to backchannel, and how to avoid interruptions, under a safety shield. Training uses counterfactual replay (bounded nonverbal perturbations) and an uncertainty-aware turn manager that probes to reduce diagnostic ambiguity. Results are simulation-only; the humanoid is the transfer target. In comparing three controllers, a custom TD3 (Twin Delayed DDPG) outperformed PPO and CEM, achieving near-ceiling coverage with steadier pace at comparable rewards. Decision-quality analyses show negligible turn overlap, aligned cut timing, fewer clarification prompts, and shorter waits. Performance stays stable under modality dropout and a renderer swap, and rankings hold on a held-out patient split. Contributions: (1) an agent-centred simulator that turns interviews into 276 interactive patients with bounded nonverbal counterfactuals; (2) a safe learning loop that treats timing and rapport as first-class control variables; (3) a comparative study (TD3 vs PPO/CEM) with clear gains in completeness and social timing; and (4) ablations and robustness analyses explaining the gains and enabling clinician-supervised humanoid pilots.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.08952

Country:

Europe > Middle East > Cyprus (0.16)
Oceania > Australia (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

Donepudi, Dharma Teja

arXiv.org Artificial IntelligenceOct-30-2025

Intra - sentence multilingual speech synthesis (code - switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed - language contexts. We introduce Script - First Multilingual Synthesis with Adaptive Locale Resolution (SFMS - ALR) an engine - agnostic framework for fluent, real - time code - switched speech generation. SFMS - ALR segments input text by Unicode script, applies adaptive language identification to determine each segment's language and locale, and normalizes prosody using sentiment - aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate or spans and synthesizes the utterance in a single TTS request. Unlike end - to - end multilingual models, SFMS - ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data - driven pipelines such as Unicom and Mask LID demonstrates SFMS - ALR's flexibility, interpretability, and immediate deployability . The framework establishes a modular baseline for high - quality, engine - independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.

artificial intelligence, natural language, speech synthesis, (13 more...)

arXiv.org Artificial Intelligence

2510.25178

Genre: Research Report (0.51)

Industry: Information Technology > Services (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.88)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.74)

Add feedback

Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

Lin, Zhiyu, Yang, Jingwen, Zhao, Jiale, Liu, Meng, Li, Sunzhu, Wang, Benyou

arXiv.org Artificial IntelligenceOct-24-2025

Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.20513

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures

Ma, Xinyue, Pastells, Pol, Farrús, Mireia, Taulé, Mariona

arXiv.org Artificial IntelligenceOct-17-2025

Semantic prosody is a collocational meaning formed through the co-occurrence of a linguistic unit and a consistent series of collocates, which should be treated separately from semantic meaning. Since words that are literal translations of each other may have different semantic prosody, more attention should be paid to this linguistic property to generate accurate translations. However, current machine translation models cannot handle this problem. To bridge the gap, we propose an approach to teach machine translation models about semantic prosody of a specific structure. We focus on Chinese BEI passives and create a dataset of English-Chinese sentence pairs with the purpose of demonstrating the negative semantic prosody of BEI passives. Then we fine-tune OPUS-MT, NLLB-600M and mBART50 models with our dataset for the English-Chinese translation task. Our results show that fine-tuned MT models perform better on using BEI passives for translating unfavourable content and avoid using it for neutral and favourable content. Also, in NLLB-600M, which is a multilingual model, this knowledge of semantic prosody can be transferred from English-Chinese translation to other language pairs, such as Spanish-Chinese.

artificial intelligence, machine translation, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.14662

Country:

Europe (1.00)
Asia > China (0.28)

Genre: Research Report > New Finding (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Zhao, Junchuan, Wang, Xintong, Wang, Ye

arXiv.org Artificial IntelligenceSep-30-2025

Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the V ALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (P ACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating P ACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.

large language model, machine learning, voice conversion, (19 more...)

arXiv.org Artificial Intelligence

2505.15402

Country: Asia > Singapore (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)

Add feedback

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

Wang, Ke, Wei, Wenning, Deng, Yan, He, Lei, Zhao, Sheng

arXiv.org Artificial IntelligenceSep-22-2025

Automatic Pronunciation Assessment (APA) is critical for Computer-Assisted Language Learning (CALL), requiring evaluation across multiple granularities and aspects. Large Multimodal Models (LMMs) present new opportunities for APA, but their effectiveness in fine-grained assessment remains uncertain. This work investigates fine-tuning LMMs for APA using the Speechocean762 dataset and a private corpus. Fine-tuning significantly outperforms zero-shot settings and achieves competitive results on single-granularity tasks compared to public and commercial systems. The model performs well at word and sentence levels, while phoneme-level assessment remains challenging. We also observe that the Pearson Correlation Coefficient (PCC) reaches 0.9, whereas Spearman's rank Correlation Coefficient (SCC) remains around 0.6, suggesting that SCC better reflects ordinal consistency. These findings highlight both the promise and limitations of LMMs for APA and point to future work on fine-grained modeling and rank-aware evaluation.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.15701

Genre: Research Report > New Finding (0.47)

Industry: Education > Curriculum > Subject-Specific Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

Filters

Collaborating Authors

prosody

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

8a9c8ac001d3ef9e4ce39b1177295e03-Paper.pdf

Neural Dubber: Dubbing for Videos According to Scripts

748d6b6ed8e13f857ceaa6cfbdca14b8-Paper.pdf

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Learning When to Ask: Simulation-Trained Humanoids for Mental-Health Diagnosis

SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures

Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment