Review for NeurIPS paper: Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

Neural Information Processing Systems 

Weaknesses: Overall, I really liked the experiments in the paper, but some of the analysis could be made more complete. In particular, I had the following questions about the experiments: 1. In the experiments of Section 4.1, a possible explanation for the model's performance could be that it's not due to implicit reasoning but due to the distractor subject leaking the answer: For example, given A mammal has a belly button and A whale eats fish, the model can infer that a whale has a belly button simply by combining these facts instead of doing implicit reasoning. One possible explanation for why the hypothesis-only results are poor is because the hypothesis-only conditions are only seen in 20% of the examples. What happens if the model is re-trained with only hypothesis information and labels, and then evaluated in the hypothesis only mode? 4. To isolate the effect of pre-trained representations, why do the authors choose to use a different architecture (ESIM) instead of using RoBERTa with randomly initialized weights? 5.