Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Open in new window