Precise Model Benchmarking with Only a Few Observations