How Benchmark Prediction from Fewer Data Misses the Mark