Achieving Cross Modal Generalization with Multimodal Unified Representation Hai Huang 1