Vision-Language Foundation Models as Effective Robot Imitators