Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

Open in new window