Statistical Learning
Adversarial Robustness with Semi-Infinite Constrained Learning
Despite strong performance in numerous applications, the fragility of deep learning to input perturbations has raised serious questions about its use in safety-critical domains. While adversarial training can mitigate this issue in practice, state-ofthe-art methods are increasingly application-dependent, heuristic in nature, and suffer from fundamental trade-offs between nominal performance and robustness. Moreover, the problem of finding worst-case perturbations is non-convex and underparameterized, both of which engender a non-favorable optimization landscape. Thus, there is a gap between the theory and practice of adversarial training, particularly with respect to when and why adversarial training works. In this paper, we take a constrained learning approach to address these questions and to provide a theoretical foundation for robust learning. In particular, we leverage semi-infinite optimization and non-convex duality theory to show that adversarial training is equivalent to a statistical problem over perturbation distributions, which we characterize completely. Notably, we show that a myriad of previous robust training techniques can be recovered for particular, sub-optimal choices of these distributions. Using these insights, we then propose a hybrid Langevin Monte Carlo approach of which several common algorithms (e.g., PGD) are special cases. Finally, we show that our approach can mitigate the trade-off between nominal and robust performance, yielding state-of-the-art results on MNIST and CIFAR-10.
ATheory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning (Supplementary File)
This supplementary document contains more additional experimental details and the technical proofs of convergence results of the NeurIPS'21 submission entitled "ATheory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning". It is structured as follows. In Appendix A, we provides more experimental details, including training algorithm, network architecture, optimizer details, loss construction and training cost of SANE. Appendix B presents the proof and details of the main results, namely, Theorem 1, in Section 2, which analyzes the generalization performance of MoCo. Next, Appendix C introduces the proof roadmap and details of the main results, i.e.
Appendix
We first introduce some handy concepts and results to make the proof succinct, meanwhile providing more information for understanding our model and theory. We begin with some extended discussions on CSG. Note that a reparameterization unnecessarily has its output dimensions in S, i.e. The condition that p(y|s) = p0(y|ฮฆS(s,v)) for any v V does not indicate that ฮฆS(s,v) is constant of v, since p0(y|s0) may ignore the change of s0 = ฮฆS(s,v) from the change of v. The following lemma shows the meaning of a reparameterization: it allows a CSG to vary while inducing the same distribution on the observed data variables (x,y) (i.e., holding the same effect on describing data). We can now define and verify an equivalent relation on CSGs so that the resulting equivalent class contains CSGs that induce the same (x,y) data distribution and hold the same semantic information in their svariables. We say two CSGs pand p0 are semantic-equivalent, if there exists a homeomorphism11 ฮฆ on S V, such that (i) is semantic-preserving: its output dimensions in S is constant of v, ฮฆS(s,v) = ฮฆS(s) for any v V, and (ii) it acts as a reparameterization from p to p0: ฮฆ#[ps,v] = p0s,v, p(x|s,v) = p0(x|ฮฆ(s,v)) and p(y|s) = p0(y|ฮฆS(s)). A.1 below shows that the defined binary relation is indeed an equivalence relation in common cases. As a reparameterization, ฮฆ allows the two models to have different latent-variable parameterizations while inducing the same distribution on the observed data variables (x,y) (Lemma 9). This definition of semantic-equivalence can be rephrased as the existence of a semantic-preserving reparameterization. With proper model assumptions, we can show that any reparameterization between two CSGs is semantic-preserving, so that semantic-preserving CSGs cannot be converted to each other by a reparameterization that mixes swith v. Lemma 11. For two CSGs pand p0, if p0(y|s) has a statistics M0(s) that is an injective function of s, then any reparameterization ฮฆ from pto p0, if exists, has its ฮฆS constant of v. Proof. Then the condition that p(y|s) = p0(y|ฮฆS(s,v)) for any v V indicates that M(s) = M0(ฮฆS(s,v)). If there exist s S and v(1) 6= v(2) V such that ฮฆS(s,v(1)) 6= ฮฆS(s,v(2)), then M0(ฮฆS(s,v(1))) 6= M0(ฮฆS(s,v(2))) 11A transformation is a homeomorphism if it is a continuous bijection with continuous inverse. This violates M(s) = M0(ฮฆS(s,v)) which requires both M0(ฮฆS(s,v(1))) and M0(ฮฆS(s,v(2))) to be equal to M(s). We then introduce two mathematical facts. Let z be a random variable on a Euclidean space RdZ with density function pz(z), and let ฮฆ be a homeomorphism on RdZ whose inverse ฮฆ 1 is differentiable.