Review for NeurIPS paper: Neural Execution Engines: Learning to Execute Subroutines

Neural Information Processing Systems 

Weaknesses: In general, I think the technical novelty of this work is limited. In particular, they claim that an additional mask prediction component is necessary to achieve generalization. My understanding is that the training supervision of NEE includes the desired mask at each execution step, which corresponds to the data pointers. However, it is unclear whether the training supervision of the baseline Transformer also includes the ground truth masks, or it only includes the output value at each step. Basically, I want to know whether the improvement comes from the more fine-grained supervision or the architectural design.