Characterizing and Efficiently Accelerating Multimodal Generation Model Inference