Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Open in new window