MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping