Tell What You Hear From What You See -- Video to Audio Generation Through Text