Toward Robust Multimodal Learning using Multimodal Foundational Models