> Include discussion of [A,B,C,D] and empirical comparison to [A,B,C, 21] We will add empirical comparison with [A,B,C] (the authors of [21] have not released their model, nor did they present

Neural Information Processing Systems 

We thank the reviewers for the time and attention invested in these reviews. We address the reviewers' remarks below. The MSCOCO test server results from our model are shown below. For a fairer comparison, we will do so in the camera-ready paper version. A comparison based on human judgments should give additional insights, and we'll add it to the camera-ready paper.