Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Open in new window