near-optimal
Near-optimal learning with average Hölder smoothness
We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the effective smoothness of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic worst-case Hölder constant.We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness.Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate.From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates.Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry.Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.
Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity
Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings. Use the Report an Issue link to request a name change.
Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity
This paper considers the distributed convex-concave minimax optimization under the second-order similarity.We propose stochastic variance-reduced optimistic gradient sliding (SVOGS) method, which takes the advantage of the finite-sum structure in the objective by involving the mini-batch client sampling and variance reduction.We prove SVOGS can achieve the \varepsilon -duality gap within communication rounds of {\mathcal O}(\delta D 2/\varepsilon), communication complexity of {\mathcal O}(n \sqrt{n}\delta D 2/\varepsilon),and local gradient calls of \tilde{\mathcal O}(n (\sqrt{n}\delta L)D 2/\varepsilon\log(1/\varepsilon)), where n is the number of nodes, \delta is the degree of the second-order similarity, L is the smoothness parameter and D is the diameter of the constraint set.We can verify that all of above complexity (nearly) matches the corresponding lower bounds.For the specific \mu -strongly-convex- \mu -strongly-convex case, our algorithm has the upper bounds on communication rounds, communication complexity, and local gradient calls of \mathcal O(\delta/\mu\log(1/\varepsilon)), {\mathcal O}((n \sqrt{n}\delta/\mu)\log(1/\varepsilon)), and \tilde{\mathcal O}(n (\sqrt{n}\delta L)/\mu)\log(1/\varepsilon)) respectively, which are also nearly tight.Furthermore, we conduct the numerical experiments to show the empirical advantages of proposed method.
Near-optimal learning with average Hölder smoothness
We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant.We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness.Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate.From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates.Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry.Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.
Procrastinating with Confidence: Near-Optimal, Anytime, Adaptive Algorithm Configuration
Kleinberg, Robert, Leyton-Brown, Kevin, Lucier, Brendan, Graham, Devon
Algorithm configuration methods optimize the performance of a parameterized heuristic algorithm on a given distribution of problem instances. Recent work introduced an algorithm configuration procedure ( Structured Procrastination'') that provably achieves near optimal performance with high probability and with nearly minimal runtime in the worst case. It also offers an anytime property: it keeps tightening its optimality guarantees the longer it is run. Unfortunately, Structured Procrastination is not adaptive to characteristics of the parameterized algorithm: it treats every input like the worst case. Follow-up work ( LeapsAndBounds'') achieves adaptivity but trades away the anytime property.