Goto

Collaborating Authors

 weobtain


Appendix

Neural Information Processing Systems

We have shown experimentally that our method is effective in a variety of domains; however, other problem domains may require additional hyperparameter tuning, which can be expensive.


Appendix

Neural Information Processing Systems

In particular,SQuARM-SGD[45]can be viewed asCHOCO-SGD with momentum, but its theoretical convergence rate is slower than the originalCHOCO-SGD. We provide some examples of compression operators satisfying Definition 1 that are used in our experiments. Line 6), the penultimate line follows from W1 = 1, and the last line follows from the induction hypothesis at thet-th iteration. Line 3), in the second line we use the propertyofthemixingmatrix 1 W =1,andinthethirdline,weapplyYoung'sinequality(cf.(9)). Bounding โ„ฆt2 in (14b) Similar to the derivation of (14a), by applying the update rule ofGt in BEER(Line 8),thedefinition ofcompression operators (Definition 1),andYoung'sinequality,we have It then boils down to establish (26).


bc6d753857fe3dd4275dff707dedf329-Supplemental.pdf

Neural Information Processing Systems

In this setting, unlike basic setting, objective and constraints are not linear. We focus on a single state-action pairs,a, stage h, and objectivem. Similarly, in constrained settings, its estimated resource consumptions are underestimates of the true resource consumptions. B.5 BoundingtheBellmanerror We now provide an upper bound on the Bellman error which arises in the RHS of the regret decomposition(Proposition3.3). When neither failure events occur (probability 1 2ฮด), Proposition 3.3 upper bounds either of reward or consumption regret by In this section, we prove the main guarantee for the convex-concave setting.


Estimator

Neural Information Processing Systems

Observationso = ฮดx are sampled with uniform distribution onx U[ 1,3](shown in blue) ห†fฮป is calculated 500 times for different realizations of the training data (10 example predictors are shown in dashed lines), its mean and 2 standard deviation are shown in red. The true function f (x) = x2 +2cos(4x)is shown in black. Preliminary: Big-Pnotation Throughout our proofs, we will frequently rely on a polynomial analogue of the big-O notation, whichwecallbig-P: Definition1. Let us observe that all the quantities we study (the predictor, the risk and empirical risk) stay the sameifanyobservation oi isreplacedby oi. The existence and the uniqueness of the solution in the cone spanned by1and 1/z of theequation canbeargued asfollows.


SupplementaryMaterial: RelaxingLocalRobustness

Neural Information Processing Systems

This presents aproblem for certifying unseen points asthe ground truth cannot be known. We therefore stipulate that certification must be independent of the true label of the point being certified. Moreover, replacing the ground truth with the predicted label is unsatisfactory,because thepurpose ofgeneralizing totop-k predictions istoconsider cases where anyofthepredictionsinFk(x)maybecorrect. Wewouldthusliketopredict only whenm(S,x) < 0. To accomplish this we create an instrumented model,g, as given by EquationB2. First, by applying (C7), we obtain (C8).


andLearning

Neural Information Processing Systems

Broadly speaking, compression eitherinvolvesquantization [33,50,27,26,28-31,15, 32]to reduce the precision of transmitted information, or biased sparsification [24,25,35,34,51, 52, 49, 53] to transmit only a few components of a vector with the largest magnitudes. TheDIANAtechnique was further generalized in [31]to account for avariety of compressors. For0 < ฮท 1 L+ฮฒ < 1 Li+ฮฒ, i S, we have0 < 1 ฮท(ฮปi +ฮฒ) < 1, and hence,D is asymmetric positive-definite matrix. In this section, we will compile some results that will proveto be useful later in our analysis. Wedosotosetupthebasic proof structure that we will later build on for analyzing more involved settings.


SupplementaryMaterial: RobustOptimalTransport withApplicationsinGenerativeModelingand DomainAdaptation 1 Proofs

Neural Information Processing Systems

Y The constraint P X,P Y Prob(X) states that P X and P Y are valid probability distributions. For brevity, we shall ignore explicitly stating it in the rest of the proof. The above equation is similar in spirit to the Kantrovich-Rubinstein duality. An important observation to note is that the above optimization only maximizes over a single discriminator function (as opposed to two functions in optimization (2)). Hence, it is easier to train it in large-scale deep learningproblemssuchasGANs.


1n logE h

Neural Information Processing Systems

Lemma 2 (Chernoff bound for irreducible Markov chains). The proof is based on the argument given in Appendix A.2 of [7], adapted though for the case of Markov chains. We start the analysis by establishing the relation between the expected regret, Equation 1, and its proxy,Equation17. For the first part, we show in Appendix C that the expected number of times that an arma {1,...,N}hasn'tbeenplayed,isoftheorderofO(loglogT). Assume that the one-parameter family of Markov chains on the finite state space S, together with the reward functionf: S R, satisfy conditions (18), (19), (20), (21), and (22).


Appendix

Neural Information Processing Systems

We have shown experimentally that our method is effective in a variety of domains; however, other problem domains may require additional hyperparameter tuning, which can be expensive.