Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants