Collaborative Learning to Generate Audio-Video Jointly