weobtain
Appendix
In particular,SQuARM-SGD[45]can be viewed asCHOCO-SGD with momentum, but its theoretical convergence rate is slower than the originalCHOCO-SGD. We provide some examples of compression operators satisfying Definition 1 that are used in our experiments. Line 6), the penultimate line follows from W1 = 1, and the last line follows from the induction hypothesis at thet-th iteration. Line 3), in the second line we use the propertyofthemixingmatrix 1 W =1,andinthethirdline,weapplyYoung'sinequality(cf.(9)). Bounding โฆt2 in (14b) Similar to the derivation of (14a), by applying the update rule ofGt in BEER(Line 8),thedefinition ofcompression operators (Definition 1),andYoung'sinequality,we have It then boils down to establish (26).
bc6d753857fe3dd4275dff707dedf329-Supplemental.pdf
In this setting, unlike basic setting, objective and constraints are not linear. We focus on a single state-action pairs,a, stage h, and objectivem. Similarly, in constrained settings, its estimated resource consumptions are underestimates of the true resource consumptions. B.5 BoundingtheBellmanerror We now provide an upper bound on the Bellman error which arises in the RHS of the regret decomposition(Proposition3.3). When neither failure events occur (probability 1 2ฮด), Proposition 3.3 upper bounds either of reward or consumption regret by In this section, we prove the main guarantee for the convex-concave setting.
Estimator
Observationso = ฮดx are sampled with uniform distribution onx U[ 1,3](shown in blue) หfฮป is calculated 500 times for different realizations of the training data (10 example predictors are shown in dashed lines), its mean and 2 standard deviation are shown in red. The true function f (x) = x2 +2cos(4x)is shown in black. Preliminary: Big-Pnotation Throughout our proofs, we will frequently rely on a polynomial analogue of the big-O notation, whichwecallbig-P: Definition1. Let us observe that all the quantities we study (the predictor, the risk and empirical risk) stay the sameifanyobservation oi isreplacedby oi. The existence and the uniqueness of the solution in the cone spanned by1and 1/z of theequation canbeargued asfollows.
SupplementaryMaterial: RelaxingLocalRobustness
This presents aproblem for certifying unseen points asthe ground truth cannot be known. We therefore stipulate that certification must be independent of the true label of the point being certified. Moreover, replacing the ground truth with the predicted label is unsatisfactory,because thepurpose ofgeneralizing totop-k predictions istoconsider cases where anyofthepredictionsinFk(x)maybecorrect. Wewouldthusliketopredict only whenm(S,x) < 0. To accomplish this we create an instrumented model,g, as given by EquationB2. First, by applying (C7), we obtain (C8).
andLearning
Broadly speaking, compression eitherinvolvesquantization [33,50,27,26,28-31,15, 32]to reduce the precision of transmitted information, or biased sparsification [24,25,35,34,51, 52, 49, 53] to transmit only a few components of a vector with the largest magnitudes. TheDIANAtechnique was further generalized in [31]to account for avariety of compressors. For0 < ฮท 1 L+ฮฒ < 1 Li+ฮฒ, i S, we have0 < 1 ฮท(ฮปi +ฮฒ) < 1, and hence,D is asymmetric positive-definite matrix. In this section, we will compile some results that will proveto be useful later in our analysis. Wedosotosetupthebasic proof structure that we will later build on for analyzing more involved settings.
SupplementaryMaterial: RobustOptimalTransport withApplicationsinGenerativeModelingand DomainAdaptation 1 Proofs
Y The constraint P X,P Y Prob(X) states that P X and P Y are valid probability distributions. For brevity, we shall ignore explicitly stating it in the rest of the proof. The above equation is similar in spirit to the Kantrovich-Rubinstein duality. An important observation to note is that the above optimization only maximizes over a single discriminator function (as opposed to two functions in optimization (2)). Hence, it is easier to train it in large-scale deep learningproblemssuchasGANs.
1n logE h
Lemma 2 (Chernoff bound for irreducible Markov chains). The proof is based on the argument given in Appendix A.2 of [7], adapted though for the case of Markov chains. We start the analysis by establishing the relation between the expected regret, Equation 1, and its proxy,Equation17. For the first part, we show in Appendix C that the expected number of times that an arma {1,...,N}hasn'tbeenplayed,isoftheorderofO(loglogT). Assume that the one-parameter family of Markov chains on the finite state space S, together with the reward functionf: S R, satisfy conditions (18), (19), (20), (21), and (22).