Controllable Video Generation with Provable Disentanglement