Goto

Collaborating Authors

 pt 1


Online Adaptive Methods, Universality and Acceleration

Kfir Y. Levy, Alp Yurtsever, Volkan Cevher

Neural Information Processing Systems

Conversely, adaptive first order methods are very popular in Machine Learning, with AdaGrad, [12],beingthemostprominent methodamongthisclass. AdaGrad isanonlinelearning algorithm which adapts its learning rate using the feedback (gradients) received through the optimization process, and is known to successfully handle noisy feedback.


max

Neural Information Processing Systems

Weintroduce asimple butgeneral online learning frameworkinwhich alearner plays against an adversary in a vector-valued game that changes every round. Even though the learner'sobjectiveis not convex-concave(and so the minimax theorem does not apply), we giveasimple algorithm that can compete with the setting in which the adversary must announce their action first, with optimally diminishing regret.



c74214a3877c4d8297ac96217d5189b7-Paper.pdf

Neural Information Processing Systems

However, the resulting methods often suffer from high computational complexity which has reduced their practical applicability. For example, in the case of multiclass logistic regression, the aggregating forecaster (Foster et al. (2018)) achievesaregret ofO(log(Bn))whereas Online Newton Step achieves O(eBlog(n))obtaining adouble exponential gaininB (aboundonthenormof comparativefunctions).






Outline

Neural Information Processing Systems

We first prove the direction that efficiency ordering implies Loewner ordering. Next we want to showlimt (I γA)t = 0. Since we assume0 < γ < 2/ A 2, we have I γA 2 = maxi=1,2,,n|1 γλi(A)| < 1, where λi(A) > 0 is thei-the eigenvalue of the positivedefinite matrixA. For the original functionG: Rd V Rd, we define another functionΦ: Rd E Rd such thatΦ(θ,eij) = G(θ,j). This is true for periodic Markov chain, and is shown in the following lemma. Due to its random nature across each epoch, random shuffling is not a Markov chain on state space[n].