Review for NeurIPS paper: Audeo: Audio Generation for a Silent Performance Video

Neural Information Processing Systems 

Summary and Contributions: This paper proposes a novel pipeline approach for improving piano music/audio generation from silent videos with a top-view of a pianist's fingers playing on a keyboard. Prior work [27] used an end-to-end approach to directly predict a symbolic piano performance from video using ResNets. This paper points out there's a lot of mismatch between the video and music/audio streams and hence the processing requires multiple stages of transformation. The proposed pipeline consists of three interpretable components / stages. Video2Roll consists of three stages.