Reviews: Image Captioning: Transforming Objects into Words

Neural Information Processing Systems 

Summary - The proposed approach to image captioning extends two prior works, object-based Up-Down method of [2] and Transformer of [22] (already used for image captioning in [21]). Specifically, the authors integrate spatial relations between objects in the captioning Transformer model, proposing the Object Relation Transformer. The modification amounts to introducing an object relation module [9] into the encoding layer of the Transformer model. Tests of statistical significance show that the proposed model outperforms the standard Transformer in terms of CIDEr-D, BLEU-1 and ROUGE-L, while SPICE-attribute breakdown shows improvement for Relation and Count categories. Qualitative results include examples where Object Relation Transformer leads to more correct spatial Relation and Count predictions.