Leveraging per Image-Token Consistency for Vision-Language Pre-training

Open in new window