The Limits of Assumption-free Tests for Algorithm Performance