Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy