Better scalability under potentially heavy-tailed feedback
We study scalable alternatives to robust gradient descent (RGD) techniques that can be used when the losses and/or gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we instead focus computational effort on robustly choosing (or newly constructing) a strong candidate based on a collection of cheap stochastic sub-processes which can be run in parallel. The exact selection process depends on the convexity of the underlying objective, but in all cases, our selection technique amounts to a robust form of boosting the confidence of weak learners. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data, along with applications to a variety of benchmark datasets. The overall take-away is an extensible procedure that is simple to implement, trivial to parallelize, which keeps the formal merits of RGD methods but scales much better to large learning problems.
Dec-14-2020
- Country:
- North America > Canada
- Europe > United Kingdom
- England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- England
- Asia
- Middle East > Israel
- Jerusalem District > Jerusalem (0.04)
- Japan > Honshū
- Kansai > Osaka Prefecture > Osaka (0.04)
- Middle East > Israel
- Genre:
- Research Report > New Finding (0.45)
- Industry:
- Education (0.65)
- Technology: