Goto

Collaborating Authors

 nulla 1


Multi-Timescale Gradient Sliding for Distributed Optimization

Zhang, Junhui, Jaillet, Patrick

arXiv.org Artificial Intelligence

We propose two first-order methods for convex, non-smooth, distributed optimization problems, hereafter called Multi-Timescale Gradient Sliding (MT-GS) and its accelerated variant (AMT-GS). Our MT-GS and AMT-GS can take advantage of similarities between (local) objectives to reduce the communication rounds, are flexible so that different subsets (of agents) can communicate at different, user-picked rates, and are fully deterministic. These three desirable features are achieved through a block-decomposable primal-dual formulation, and a multi-timescale variant of the sliding method introduced in Lan et al. (2020), Lan (2016), where different dual blocks are updated at potentially different rates. To find an $ε$-suboptimal solution, the complexities of our algorithms achieve optimal dependency on $ε$: MT-GS needs $O(\overline{r}A/ε)$ communication rounds and $O(\overline{r}/ε^2)$ subgradient steps for Lipchitz objectives, and AMT-GS needs $O(\overline{r}A/\sqrt{εμ})$ communication rounds and $O(\overline{r}/(εμ))$ subgradient steps if the objectives are also $μ$-strongly convex. Here, $\overline{r}$ measures the ``average rate of updates'' for dual blocks, and $A$ measures similarities between (subgradients of) local functions. In addition, the linear dependency of communication rounds on $A$ is optimal (Arjevani and Shamir 2015), thereby providing a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).


Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification

Tsigler, Alexander, Chamon, Luiz F. O., Frei, Spencer, Bartlett, Peter L.

arXiv.org Machine Learning

In this work, we investigate the behavior of ridge regression in an overparameterized binary classification task. We assume examples are drawn from (anisotropic) class-conditional cluster distributions with opposing means and we allow for the training labels to have a constant level of label-flipping noise. We characterize the classification error achieved by ridge regression under the assumption that the covariance matrix of the cluster distribution has a high effective rank in the tail. We show that ridge regression has qualitatively different behavior depending on the scale of the cluster mean vector and its interaction with the covariance matrix of the cluster distributions. In regimes where the scale is very large, the conditions that allow for benign overfitting turn out to be the same as those for the regression task. We additionally provide insights into how the introduction of label noise affects the behavior of the minimum norm interpolator (MNI). The optimal classifier in this setting is a linear transformation of the cluster mean vector and in the noiseless setting the MNI approximately learns this transformation. On the other hand, the introduction of label noise can significantly change the geometry of the solution while preserving the same qualitative behavior.


Task Shift: From Classification to Regression in Overparameterized Linear Models

LaBonte, Tyler, Lai, Kuo-Wei, Muthukumar, Vidya

arXiv.org Machine Learning

Modern machine learning methods have recently demonstrated remarkable capability to generalize under task shift, where latent knowledge is transferred to a different, often more difficult, task under a similar data distribution. We investigate this phenomenon in an overparameterized linear regression setting where the task shifts from classification during training to regression during evaluation. In the zero-shot case, wherein no regression data is available, we prove that task shift is impossible in both sparse signal and random signal models for any Gaussian covariate distribution. In the few-shot case, wherein limited regression data is available, we propose a simple postprocessing algorithm which asymptotically recovers the ground-truth predictor. Our analysis leverages a fine-grained characterization of individual parameters arising from minimum-norm interpolation which may be of independent interest. Our results show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data.


Fast Adaptive Federated Bilevel Optimization

Huang, Feihu

arXiv.org Artificial Intelligence

Bilevel optimization is a popular hierarchical model in machine learning, and has been widely applied to many machine learning tasks such as meta learning, hyperparameter learning and policy optimization. Although many bilevel optimization algorithms recently have been developed, few adaptive algorithm focuses on the bilevel optimization under the distributed setting. It is well known that the adaptive gradient methods show superior performances on both distributed and non-distributed optimization. In the paper, thus, we propose a novel adaptive federated bilevel optimization algorithm (i.e.,AdaFBiO) to solve the distributed bilevel optimization problems, where the objective function of Upper-Level (UL) problem is possibly nonconvex, and that of Lower-Level (LL) problem is strongly convex. Specifically, our AdaFBiO algorithm builds on the momentum-based variance reduced technique and local-SGD to obtain the best known sample and communication complexities simultaneously. In particular, our AdaFBiO algorithm uses the unified adaptive matrices to flexibly incorporate various adaptive learning rates to update variables in both UL and LL problems. Moreover, we provide a convergence analysis framework for our AdaFBiO algorithm, and prove it needs the sample complexity of $\tilde{O}(\epsilon^{-3})$ with communication complexity of $\tilde{O}(\epsilon^{-2})$ to obtain an $\epsilon$-stationary point. Experimental results on federated hyper-representation learning and federated data hyper-cleaning tasks verify efficiency of our algorithm.