Rethinking the Evaluation of Neural Machine Translation