disagreement discrepancy
On the Bayes Inconsistency of Disagreement Discrepancy Surrogates
Marchant, Neil G., Cullen, Andrew C., Liu, Feng, Erfani, Sarah M.
Deep neural networks often fail when deployed in real-world contexts due to distribution shift, a critical barrier to building safe and reliable systems. An emerging approach to address this problem relies on \emph{disagreement discrepancy} -- a measure of how the disagreement between two models changes under a shifting distribution. The process of maximizing this measure has seen applications in bounding error under shifts, testing for harmful shifts, and training more robust models. However, this optimization involves the non-differentiable zero-one loss, necessitating the use of practical surrogate losses. We prove that existing surrogates for disagreement discrepancy are not Bayes consistent, revealing a fundamental flaw: maximizing these surrogates can fail to maximize the true disagreement discrepancy. To address this, we introduce new theoretical results providing both upper and lower bounds on the optimality gap for such surrogates. Guided by this theory, we propose a novel disagreement loss that, when paired with cross-entropy, yields a provably consistent surrogate for disagreement discrepancy. Empirical evaluations across diverse benchmarks demonstrate that our method provides more accurate and robust estimates of disagreement discrepancy than existing approaches, particularly under challenging adversarial conditions.
- North America > Canada > Ontario > Toronto (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Appendix A Experimental Details
This is referred to as DOC-Feat in [24]. COT uses the empirical estimator of the Earth Mover's Distance between labels from the source domain and softmax outputs of samples from the target A.2 Dataset Details In this section, we provide additional details about the datasets used in our benchmark study. Overall, we obtain 5 datasets (i.e., CIFAR10v1, CIF AR100 Similar to CIFAR10, we use the original CIFAR100 set as the source dataset. Overall, we obtain 3 different domains. Overall, we obtain 3 different domains.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
ODD: Overlap-aware Estimation of Model Performance under Distribution Shift
Reliable and accurate estimation of the error of an ML model in unseen test domains is an important problem for safe intelligent systems. Prior work uses disagreement discrepancy (DIS^2) to derive practical error bounds under distribution shifts. It optimizes for a maximally disagreeing classifier on the target domain to bound the error of a given source classifier. Although this approach offers a reliable and competitively accurate estimate of the target error, we identify a problem in this approach which causes the disagreement discrepancy objective to compete in the overlapping region between source and target domains. With an intuitive assumption that the target disagreement should be no more than the source disagreement in the overlapping region due to high enough support, we devise Overlap-aware Disagreement Discrepancy (ODD). Maximizing ODD only requires disagreement in the non-overlapping target domain, removing the competition. Our ODD-based bound uses domain-classifiers to estimate domain-overlap and better predicts target performance than DIS^2. We conduct experiments on a wide array of benchmarks to show that our method improves the overall performance-estimation error while remaining valid and reliable. Our code and results are available on GitHub.
(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy
We derive a new, (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods are either vacuous in practice or accurate on average but heavily underestimate error for a sizeable fraction of shifts. In particular, the latter only give guarantees based on complex continuous measures such as test calibration, which cannot be identified without labels, and are therefore unreliable. Instead, our bound requires a simple, intuitive condition which is well justified by prior empirical works and holds in practice effectively 100\% of the time. The bound is inspired by \mathcal{H}\Delta\mathcal{H} -divergence but is easier to evaluate and substantially tighter, consistently providing non-vacuous test error upper bounds.
(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy
Rosenfeld, Elan, Garg, Saurabh
When deploying a model, it is important to be confident in how it will perform under inevitable distribution shift. Standard methods for achieving this include data dependent uniform convergence bounds (Ben-David et al., 2006, Mansour et al., 2009) (typically vacuous in practice) or assuming a precise model of how the distribution can shift (Chen et al., 2022, Rahimian and Mehrotra, 2019, Rosenfeld et al., 2021). Unfortunately, it is difficult or impossible to determine how severely these assumptions are violated by real data ("all models are wrong"), so practitioners usually cannot trust such bounds with confidence. To better estimate test performance in the wild, some recent work instead tries to directly predict accuracy of neural networks using unlabeled data from the test distribution of interest, (Baek et al., 2022, Garg et al., 2022, Lu et al., 2023). While these methods predict the test performance surprisingly well, they lack pointwise trustworthiness and verifiability: their estimates are good on average over all distribution shifts, but they provide no guarantee or signal of the quality of any individual prediction (here, each point is a single test distribution, for which a method predicts a classifier's average accuracy). Because of the opaque conditions under which these methods work, it is also difficult to anticipate their failure cases--indeed, it is reasonably common for them to substantially overestimate test accuracy for a particular shift, which is problematic when optimistic deployment can be costly or catastrophic.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)