Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing