Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization