Disentangling Space and Time in Video with Hierarchical Variational Auto-encoders