Goto

Collaborating Authors

 lemmaa


e45caa3d5273d105b8d045e748636957-Supplemental-Conference.pdf

Neural Information Processing Systems

InFigure 7 of this Appendix, we show that indeed this is due to a decrease in the robustness slope. Across three different datasets, MNIST, CIFAR10, NewsGroup20, we see that increasing the number of tasks leads to a decrease in the robustness slope. Experiments on other languages For our experiments on multilingual generative models, we decided to use Greek and English because we were looking for a linguistic pair with different morphology,syntaxandphonology. This ensures that any benefits in terms of robustness are not coming from exposure to more data. Asshownin Figure 8,eventhough thetwomodels arestarting from roughly thesame perplexity,thebilingual model exhibits higher structural robustness in the presence of weight deletions.


d9d347f57ae11f34235b4555710547d8-Supplemental.pdf

Neural Information Processing Systems

Let X,Y,Z be random variables. Let g: X R be a measurable function, and let Ex Q[expg(x)] .Then DKL(P||Q)=sup Their work has built a connection between PACBayes meta-learning and Hierarchical Variational Bayes. In Appendix A.3 of [1], they give thegenerativegraph model formeta learning whereU W S (their notation usedψ instead of U). The proof technique is analogous to Theorem 5.1. LetΦ = (U,W1:n) be a collection of random variables whereΦ U Wn such thatΦandS1:n follow the joint distributionPΦ,S1;n. Based on Theorem 5.2, for the Meta-SGLD that satisfies Assumption 1, if we set Infact, the algorithm has anest-loop structure, we just list the abovesimple sub-structures for the firststepoftheproof.


AppendixOutline

Neural Information Processing Systems

Hence, we rely on subgradients defined in Equation 7. Since, many subgradient directions exist for the margin points, for consistency, we stick with xlγ(w;(x,y)) = {0}wheny w,x = γ. Note, that thesetofpoints inX satisfying this equality isazeromeasure set. For simplicity we shall treat the projection operation as just renormalizing w(t+1) to have unit norm,i.e., w(t+1) 2 = 1, t 0. This is not necessarily restrictive. A.1 TechnicalLemmas In this section we shall state some technical lemmas without proof, with references to works that contain the full proof. We shall use these in the following sections when proving our lemmas in Section5.


SupplementaryMaterialfor: AdversarialRegression withDoubly Non-negativeWeightingMatrices

Neural Information Processing Systems

A.1 ProofsofSection3 In the following, the symbolh, i will be used to represent both Frobenius norm of matrices and standard Euclidean norm of vectors. For the second part, letv be an eigenvector ofA corresponding to eigenvalueλmax(A). Incase the maximum eigenvalue ofT isnonpositive, then from Lemma A.1 we see that the objectivevalue of problem(A.2)evaluated For anp preal matrixA, its spectral radiusR(A)is defined as the largest absolute value of its eigenvalues. Then the matrixI A is invertible and all entries of(I A) 1 are nonnegative. Also the spectral radius of(γ?) 1bΩ12V(β)bΩ12 is smaller than1 by the feasibility ofγ? in problem (A.5c).


EscapingSaddle-PointFasterunder Interpolation-likeConditions

Neural Information Processing Systems

One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, thefirst-order oracle complexityofPerturbed Stochastic Gradient Descent (PSGD) algorithm toreach an -local-minimizer,matches the corresponding deterministic rateof O(1/2).



5d9e4a04afb9f3608ccc76c1ffa7573e-Supplemental.pdf

Neural Information Processing Systems

Sets and scalars are represented by calligraphic and standard fonts,6 respectively. Intuitively, if Φ (w0) is a (µΦ,νΦ)-near-isometry, then one would expect Φ to remain near-10 isometry forallnearby points. We start with the basic definition of Hermite polynomial and its properties. A bound on (2kvk + kδvk) is obtained in (A.41). Let z Rd denote a Gaussian random vector.


1 ContextandMotivation

Neural Information Processing Systems

The coding rate can be accurately computed from finite samples of degenerate subspace-like distributions and can learn intrinsic representations in supervised, self-supervised, and unsupervised settings in a unified manner.



176a579942089c4cdc70136c567932ab-Paper-Conference.pdf

Neural Information Processing Systems

We consider here the sparse Gaussian process regression (SGPR) approach introduced by Titsias [31], which is widely used in practice (see [1, 9] for implementations) and has been studied in many recent works [13,21,5,6,38,28,32,22,23].