AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment
Li, Ruiqi, Huang, Rongjie, Zhang, Lichao, Liu, Jinglin, Zhao, Zhou
–arXiv.org Artificial Intelligence
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, which views speech variance such as pitch and content as different modalities. Inspired by the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1) adopts a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is computed in a simple yet effective way and is quantized into a discrete space; and 2) uses the predicted rhythm representation to re-align the content based on cross-attention and conducts a cross-modal fusion for re-synthesize. Extensive experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://alignsts.github.io.
arXiv.org Artificial Intelligence
May-24-2023
- Country:
- Asia (0.14)
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Leisure & Entertainment (0.46)
- Media > Music (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.94)
- Natural Language (1.00)
- Representation & Reasoning (0.68)
- Speech (1.00)
- Information Technology > Artificial Intelligence