Confidence and Stability of Global and Pairwise Scores in NLP Evaluation