Finding Replicable Human Evaluations via Stable Ranking Probability