Convergence of Alternating Gradient Descent for Matrix Factorization
Ward, Rachel, Kolda, Tamara G.
–arXiv.org Artificial Intelligence
We consider alternating gradient descent (AGD) with fixed step size $\eta > 0$, applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = \left( \left(\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})}\right)^2 \log(1/\epsilon)\right)$ iterations of alternating gradient descent suffice to reach an $\epsilon$-optimal factorization $\| \mathbf{A} - \mathbf{X}_T^{\vphantom{\intercal}} \mathbf{Y}_T^{\intercal} \|_{\rm F}^2 \leq \epsilon \| \mathbf{A} \|_{\rm F}^2$ with high probability starting from an atypical random initialization. The factors have rank $d>r$ so that $\mathbf{X}_T\in\mathbb{R}^{m \times d}$ and $\mathbf{Y}_T \in\mathbb{R}^{n \times d}$. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly improves convergence of gradient descent in practice. Our proof is conceptually simple: a uniform PL-inequality and uniform Lipschitz smoothness constant are guaranteed for a sufficient number of iterations, starting from our random initialization. Our proof method should be useful for extending and simplifying convergence analyses for a broader class of nonconvex low-rank factorization problems.
arXiv.org Artificial Intelligence
May-11-2023
- Country:
- North America > United States
- Texas > Travis County
- Austin (0.04)
- California > Alameda County
- Dublin (0.04)
- Texas > Travis County
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Africa > Senegal
- Kolda Region > Kolda (0.05)
- North America > United States
- Genre:
- Research Report (0.84)
- Technology: