Light Future: Multimodal Action Frame Prediction via InstructPix2Pix