What is the state of the art? Accounting for multiplicity in machine learning benchmark performance