Reviews: Contrastive Learning for Image Captioning

Neural Information Processing Systems 

The paper proposed a contrastive learning approach for image captioning models. Typical image captioning models utilize log-likelihood criteria for learning, which tends to result in preferring a safer generation that lacks specific and distinct concept in an image. The paper proposes to introduce contrastive learning objective, where the objective function is based on density ratio to the reference, without altering the captioning models. The paper evaluates multiple models in MSCOCO and InstaPIC datasets, and demonstrates the effectiveness as well as conducts ablation studies. The paper is well-written and has strength in the following points.