Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks