Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Neural Information Processing Systems 

Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner.