M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Xue, Jinlong, Deng, Yayue, Wang, Fengping, Li, Ya, Gao, Yingming, Tao, Jianhua, Sun, Jianqing, Liang, Jiaen

May-3-2023–arXiv.org Artificial Intelligence

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

artificial intelligence, information, optical character recognition, (11 more...)

arXiv.org Artificial Intelligence

May-3-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - United States (0.04)
  - Canada > Quebec
    - Montreal (0.05)
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China
  - Beijing > Beijing (0.05)
  - Guangdong Province > Shenzhen (0.04)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Synthesis (1.00)
  - Vision > Optical Character Recognition (0.82)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found