TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Open in new window