T ell What You Hear From What You See - Video to Audio Generation Through Text
–Neural Information Processing Systems
When the audio caption is provided as a prompt, V A TT achieves even more refined performance (with lowest KLD score of 1.41).
Neural Information Processing Systems
Oct-10-2025, 14:21:14 GMT
- Country:
- Europe > Spain
- Catalonia > Barcelona Province > Barcelona (0.04)
- North America > United States
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Philadelphia County
- South America > Chile
- Europe > Spain
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Leisure & Entertainment (1.00)
- Media > Music (0.93)
- Technology: