CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Dong, Hao-Wen, Liu, Xiaoyu, Pons, Jordi, Bhattacharya, Gautam, Pascual, Santiago, Serrà, Joan, Berg-Kirkpatrick, Taylor, McAuley, Julian

Jul-23-2023–arXiv.org Artificial Intelligence

Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jul-23-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Taiwan (0.14)
- North America > United States
  - California (0.14)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Leisure & Entertainment (0.34)
- Media > Music (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.47)
  - Natural Language (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found