Multi-Timescale Gradient Sliding for Distributed Optimization

Zhang, Junhui, Jaillet, Patrick

arXiv.org Artificial Intelligence 

We propose two first-order methods for convex, non-smooth, distributed optimization problems, hereafter called Multi-Timescale Gradient Sliding (MT-GS) and its accelerated variant (AMT-GS). Our MT-GS and AMT-GS can take advantage of similarities between (local) objectives to reduce the communication rounds, are flexible so that different subsets (of agents) can communicate at different, user-picked rates, and are fully deterministic. These three desirable features are achieved through a block-decomposable primal-dual formulation, and a multi-timescale variant of the sliding method introduced in Lan et al. (2020), Lan (2016), where different dual blocks are updated at potentially different rates. To find an $ε$-suboptimal solution, the complexities of our algorithms achieve optimal dependency on $ε$: MT-GS needs $O(\overline{r}A/ε)$ communication rounds and $O(\overline{r}/ε^2)$ subgradient steps for Lipchitz objectives, and AMT-GS needs $O(\overline{r}A/\sqrt{εμ})$ communication rounds and $O(\overline{r}/(εμ))$ subgradient steps if the objectives are also $μ$-strongly convex. Here, $\overline{r}$ measures the ``average rate of updates'' for dual blocks, and $A$ measures similarities between (subgradients of) local functions. In addition, the linear dependency of communication rounds on $A$ is optimal (Arjevani and Shamir 2015), thereby providing a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).