How Reliable is Language Model Micro-Benchmarking?