Improved statistical benchmarking of digital pathology models using pairwise frames evaluation