Multimodal Pretrained Models for Sequential Decision-Making: Synthesis, Verification, Grounding, and Perception