When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements