Unifying Human and Statistical Evaluation for Natural Language Generation