Certifiably Adversarially Robust Detection of Out-of-Distribution Data

Neural Information Processing Systems 

Deep neural networks are known to be overconfident when applied to out-of-distribution (OOD) inputs which clearly do not belong to any class. This is a problem in safety-critical applications since a reliable assessment of the uncertainty of a classifier is a key property, allowing to trigger human intervention or to transfer into a safe state. In this paper, we are aiming for certifiable worst case guarantees for OOD detection by enforcing not only low confidence at the OOD point but also in an l_\infty -ball around it. For this purpose, we use interval bound propagation (IBP) to upper bound the maximal confidence in the l_\infty -ball and minimize this upper bound during training time. We show that non-trivial bounds on the confidence for OOD data generalizing beyond the OOD dataset seen at training time are possible.