Robot Shape and Location Retention in Video Generation Using Diffusion Models