MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Neural Information Processing Systems 

However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities.