Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers