M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning