Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks