Accounting for Underspecification in Statistical Claims of Model Superiority