The Glass Ceiling of Automatic Evaluation in Natural Language Generation