Leveraging per Image-Token Consistency for Vision-Language Pre-training