Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks