VIMI: Grounding Video Generation through Multi-modal Instruction