Supplementary Materials - Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Neural Information Processing Systems 

CLEVRER includes four types of question: descriptive ( e.g. 'what color'), explanatory ('what's responsible for'), predictive ('what will happen next'), and counterfactual ('what if'), where the first two types concern more on video understanding and temporal reasoning, while the latter two types involve physical dynamics and predictions in reasoning. Therefore, we mainly focus on the predictive and counterfactual questions in this work. CLEVRER consists of 2,000 videos, with a number of 1,000 training videos, 5,000 validation videos, and 5,000 test videos.