TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation