Reviews: Image Captioning: Transforming Objects into Words

Neural Information Processing Systems 

An object relation module is included into the transformer model. Improvements are demonstrated using this approach. After reading the rebuttal the reviewers agreed that this is an interesting direction to pursue. The reviewers liked the method and partly the results presented in the rebuttal. However the reviewers also remained concerned that additional evidence is necessary (e.g., proper evaluation on test server, experimentation with different spatial features, more in-depth discussion of the attention visualizations, empirical comparison to prior work and human evaluation).