Multimodal foundation world models for generalist embodied agents