Goto

Collaborating Authors

 Computational Learning Theory


Towards a Unified Information-Theoretic Framework for Generalization

Neural Information Processing Systems

In this work, we investigate the expressiveness of the "conditional mutual information" (CMI) framework of Steinke and Zakynthinou [1] and the prospect of using it to provide a unified framework for proving generalization bounds in the realizable setting. We first demonstrate that one can use this framework to express non-trivial (but sub-optimal) bounds for any learning algorithm that outputs hypotheses from a class of bounded VC dimension. We then explore two directions of strengthening this bound: (i) Can the CMI framework express optimal bounds for VC classes?


Information Theoretic Lower Bounds for Information Theoretic Upper Bounds

Neural Information Processing Systems

We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in informationtheoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimensiondependent mutual information is necessary. This indicates that existing informationtheoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.


Information Theoretic Lower Bounds for Information Theoretic Upper Bounds

Neural Information Processing Systems

We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in informationtheoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimensiondependent mutual information is necessary. This indicates that existing informationtheoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.


Adversarially Robust Learning: A Generic Minimax Optimal Learner and Characterization

Neural Information Processing Systems

We present a minimax optimal learner for the problem of learning predictors robust to adversarial examples at test-time. Interestingly, we find that this requires new algorithmic ideas and approaches to adversarially robust learning. In particular, we show, in a strong negative sense, the suboptimality of the robust learner proposed by [21] and a broader family of learners we identify as local learners. Our results are enabled by adopting a global perspective, specifically, through a key technical contribution: the global one-inclusion graph, which may be of independent interest, that generalizes the classical one-inclusion graph due to [17]. Finally, as a byproduct, we identify a dimension characterizing qualitatively and quantitatively what classes of predictors H are robustly learnable. This resolves an open problem due to [21], and closes a (potentially) infinite gap between the established upper and lower bounds on the sample complexity of adversarially robust learning.


Table of Contents of Appendix A Detailed Related Work 17 B Limitations and Potential Negative Societal Impacts 18 C Discussions and Details about Experiments in Figure 1 18 C.1 Summary

Neural Information Processing Systems

Distance-based methods are based on the assumption that OOD data should be relatively far away from the centroids of ID classes [9], including Mahalanobis distance [9, 45], cosine similarity [74], and kernel similarity [75]. Early works consider using the maximum softmax score to express the ID-ness [7]. Then, temperature scaling functions are used to amplify the separation between the ID and OOD data [14]. Recently, researchers propose hyperparameter-free energy scores to improve the OOD uncertainty estimation [23, 71]. Additionally, researchers also consider using the information contained in gradients to help improve the performance of OOD detection [18].


Is Out-of-Distribution Detection Learnable? Zhen Fang

Neural Information Processing Systems

Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory.


On the Power of Differentiable Learning versus PAC and SQ Learning Emmanuel Abbe

Neural Information Processing Systems

We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision ฯ of the gradient calculations relative to the minibatch size b (for SGD) and sample size m (for GD). With fine enough precision relative to minibatch size, namely when bฯ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for b = 1. Similarly, with fine enough precision relative to the sample size m, GD can also simulate any sample-based learning algorithm based on m samples. In particular, with polynomially many bits of precision (i.e. when ฯ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size.



Supplementary Material A Proof of Theorem 3.1 (Realizable Case - Positive Result) (Lrn|D,) apple c

Neural Information Processing Systems

Let H be a hypothesis class with VC dimension d and let 2 (0, 1). Then there exists a learner Lrn having -adversarial risk " To prove Theorem 3.1, we will use the S First, we prove a more general result on the performance of our SPV meta learner. We denote the algorithm obtained by executing SPV with a learner Lrn as the input algorithm by SPV(Lrn). Let H be a concept class, D be a distribution over examples, and Lrn be a learning rule. Let also 2 (0, 1) be the stability parameter given to SPV and let n 1/ be the sample size. By applying linearity of expectation we get " # X To prove Theorem 3.1, we will need an optimal learner as an input learner for SPV.


Simplifying Adversarially Robust PAC Learning with Tolerance

arXiv.org Artificial Intelligence

Adversarially robust PAC learning has proved to be challenging, with the currently best known learners [Montasser et al., 2021a] relying on improper methods based on intricate compression schemes, resulting in sample complexity exponential in the VC-dimension. A series of follow up work considered a slightly relaxed version of the problem called adversarially robust learning with tolerance [Ashtiani et al., 2023, Bhattacharjee et al., 2023, Raman et al., 2024] and achieved better sample complexity in terms of the VC-dimension. However, those algorithms were either improper and complex, or required additional assumptions on the hypothesis class H. We prove, for the first time, the existence of a simpler learner that achieves a sample complexity linear in the VC-dimension without requiring additional assumptions on H. Even though our learner is improper, it is "almost proper" in the sense that it outputs a hypothesis that is "similar" to a hypothesis in H. We also use the ideas from our algorithm to construct a semi-supervised learner in the tolerant setting. This simple algorithm achieves comparable bounds to the previous (non-tolerant) semi-supervised algorithm of Attias et al. [2022a], but avoids the use of intricate subroutines from previous works, and is "almost proper."