VIP5: Towards Multimodal Foundation Models for Recommendation