Goto

Collaborating Authors

 lemmae



Appendices

Neural Information Processing Systems

The Hessian of f(Z) can be viewed as an KN KN matrix by vectorizing the matrix Z. For deeper linear networks, it can be shown that flat saddle points exist at the origin, but there are no spurious local minima [34,37]. While most of these results based on the bottom-up approach explain optimization and generalization of certain types of deep neural networks, they provided limited insights into the practice of deep learning. In fact, our proof techniques are inspired by recent results on low-rank matrix recovery [77,80]. Some of the metrics are similar to those presented in [1]. Figure 7 depicts the learning curves in terms of both the training and test accuracy for all three optimization algorithms (i.e., SGD, Adam, and LBFGS).


f3d9de86462c28781cbe5c47ef22c3e5-Supplemental.pdf

Neural Information Processing Systems

The algorithm [62] consider Algorithm 2 for the stochastic generalized linear bandit problem. Assume thatθ is the true parameter of the reward model. Then we consider the lower bounds. For fj(A) = 12(ej1eTj2 +ej2eTj1),A with j1 j2, fj(Ai) is only 1 wheni = j and 0 otherwise. With Claim D.12 and Claim D.11 we get that g C q To get 1), we writeVl = [v1, vl] Rd l and V l = [vl+1, vk].


DeepNetworksProvablyClassifyDataonCurves Supplemental

Neural Information Processing Systems

Wewill also writeζθ(x) = fθ(x) f?(x)to denote the fitting error. We use Gaussian initialization: if` {1,2,...,L}, the weights are initialized as


LearningandTransferringSparseContextualBigrams withLinearTransformers

Neural Information Processing Systems

Weshowthat when trained from scratch,thetraining process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stageoffurther improvement. Additionally,weprovethat, provided anontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allowsustobypass the initial sample-intensivestage.


obliviousandData

Neural Information Processing Systems

In this section, we show a separation on the power of data-oblivious and data-aware poisoning attacks on classification. A different goal could be to make θ fail on a particular test set of adversary's interest, making it a targeted poisoning [3, 56] or increase the probability of a general "bad predicate" of θ [44]. We now state and prove our separation on the power of data-oblivious and data-aware poisoning attacks on classification. In particular we show that empirical risk minimization (ERM) algorithm could be much more susceptible to data-aware poisoning adversaries, compared to data-oblivious adversaries. On the other hand, any adversary will have much smaller advantage in the data-oblivious game.