Prediction with Action: Visual Policy Learning via Joint Denoising Process