This&That: Language-Gesture Controlled Video Generation for Robot Planning