When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Open in new window