When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

Nov-26-2025–arXiv.org Machine Learning

Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets.

paired bootstrap protocol, protocol, scenario, (13 more...)

arXiv.org Machine Learning

Nov-26-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > New York (0.04)
  - Canada > Ontario
    - Toronto (0.14)
- Asia > Thailand
  - Bangkok > Bangkok (0.04)

Genre:
- Research Report > Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.96)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found