Re-Imagining Multimodal Instruction Tuning: A Representation View