CLEVRER: CoLlision Events for Video REpresentation and Reasoning