Testing the Consistency of Performance Scores Reported for Binary Classification Problems