Review for NeurIPS paper: Language and Visual Entity Relationship Graph for Agent Navigation

Neural Information Processing Systems 

Weaknesses: - The proposed method is tailored for VLN and may limit its generalization to other domains (it is not new for other vision-and-language tasks). If the same h_t and u are feed into the three attentions, how could different contexts be learned? There seems to be something wrong, either the technique or the notations. However, VLN models may be sensitive to hyper-parameter tuning. It would be better if the authors can demonstrate the mean and standard deviation of multiple runs. In what cases the proposed model would fail?