T ell What You Hear From What You See - Video to Audio Generation Through Text

Neural Information Processing Systems 

When the audio caption is provided as a prompt, V A TT achieves even more refined performance (with lowest KLD score of 1.41).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found