When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards