AITopics | prox-svrg

Neural Information Processing Systems http://nips.cc/

algorithm, available, sarah, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

a376033f78e144f494bfc743c0be3330-Supplemental.pdf

Neural Information Processing SystemsFeb-10-2026, 10:14:37 GMT

Inthis section, we provide theoretical analysis ofHSPG. Moreover, we further point out that: (1) theSub-gradient Descent Stepwe used to achieve a "close enough" solution canbereplaced byothermethods, and(2)theAssumption 4isonlyasufficientcondition thatwecouldusetoshowthe"closeenough"condition. B.1 RelatedWork Problem (12)has been well studied indeterministic optimization with various algorithms that are capable ofreturning solutions with both lowobjectivevalueandhigh group sparsity under proper λ(95;73;42;64). For example, proximal stochastic variance-reduced gradient method (Prox-SVRG)(88)and proximal spider (Prox-Spider) (97) are developed to adopt multi-stage schemes based on the well-known variance reduction technique SVRG proposed in (46) and Spider developed in (22) respectively. Under Assumption 1, the search directiondk is a descent direction forψBk(xk), i.e., d>k ψBk(xk)<0.

artificial intelligence, gk 2, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Breaking the Span Assumption Yields Fast Finite-Sum Minimization

Robert Hannah, Yanli Liu, Daniel O'Connor, Wotao Yin

Neural Information Processing SystemsNov-20-2025, 19:12:47 GMT

Most finite sum algorithms follow what we call the "span

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Stochastic Proximal Gradient Descent with Acceleration Techniques

Neural Information Processing SystemsSep-30-2025, 08:48:34 GMT

Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized risk minimization problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov's acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG.

acceleration technique, name change, stochastic proximal gradient descent, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

a376033f78e144f494bfc743c0be3330-Supplemental.pdf

Neural Information Processing SystemsAug-16-2025, 13:32:51 GMT

artificial intelligence, half-space step, machine learning, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Orthant Based Proximal Stochastic Gradient Method for $\ell_1$-Regularized Optimization

Chen, Tianyi, Ding, Tianyu, Ji, Bo, Wang, Guanyi, Shi, Yixin, Yi, Sheng, Tu, Xiao, Zhu, Zhihui

arXiv.org Machine LearningApr-7-2020

Sparsity-inducing regularization problems are ubiquitous in machine learning applications, ranging from feature selection to model compression. In this paper, we present a novel stochastic method -- Orthant Based Proximal Stochastic Gradient Method (OBProx-SG) -- to solve perhaps the most popular instance, i.e., the l1-regularized problem. The OBProx-SG method contains two steps: (i) a proximal stochastic gradient step to predict a support cover of the solution; and (ii) an orthant step to aggressively enhance the sparsity level via orthant face projection. Compared to the state-of-the-art methods, e.g., Prox-SG, RDA and Prox-SVRG, the OBProx-SG not only converges to the global optimal solutions (in convex scenario) or the stationary points (in non-convex scenario), but also promotes the sparsity of the solutions substantially. Particularly, on a large number of convex problems, OBProx-SG outperforms the existing methods comprehensively in the aspect of sparsity exploration and objective values. Moreover, the experiments on non-convex deep neural networks, e.g., MobileNetV1 and ResNet18, further demonstrate its superiority by achieving the solutions of much higher sparsity without sacrificing generalization accuracy.

obprox-sg, orthant step, prox-sg step, (11 more...)

arXiv.org Machine Learning

2004.03639

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China (0.04)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Stochastic Proximal Gradient Descent with Acceleration Techniques

Nitanda, Atsushi

Neural Information Processing SystemsFeb-14-2020, 08:11:55 GMT

Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized risk minimization problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov's acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG.

acceleration technique, prox-svrg, stochastic proximal gradient descent

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Breaking the Span Assumption Yields Fast Finite-Sum Minimization

Hannah, Robert, Liu, Yanli, O', Connor, Daniel, Yin, Wotao

Neural Information Processing SystemsDec-31-2018

In this paper, we show that SVRG and SARAH can be modified to be fundamentally faster than all of the other standard algorithms that minimize the sum of $n$ smooth functions, such as SAGA, SAG, SDCA, and SDCA without duality. Most finite sum algorithms follow what we call the ``span assumption'': Their updates are in the span of a sequence of component gradients chosen in a random IID fashion. In the big data regime, where the condition number $\kappa=O(n)$, the span assumption prevents algorithms from converging to an approximate solution of accuracy $\epsilon$ in less than $n\ln(1/\epsilon)$ iterations. SVRG and SARAH do not follow the span assumption since they are updated with a hybrid of full-gradient and component-gradient information. We show that because of this, they can be up to $\Omega(1+(\ln(n/\kappa))_+)$ times faster. In particular, to obtain an accuracy $\epsilon = 1/n^\alpha$ for $\kappa=n^\beta$ and $\alpha,\beta\in(0,1)$, modified SVRG requires $O(n)$ iterations, whereas algorithms that follow the span assumption require $O(n\ln(n))$ iterations. Moreover, we present lower bound results that show this speedup is optimal, and provide analysis to help explain why this speedup exists. With the understanding that the span assumption is a point of weakness of finite sum algorithms, future work may purposefully exploit this to yield faster algorithms in the big data regime.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Breaking the Span Assumption Yields Fast Finite-Sum Minimization

Hannah, Robert, Liu, Yanli, O', Connor, Daniel, Yin, Wotao

Neural Information Processing SystemsDec-31-2018

In this paper, we show that SVRG and SARAH can be modified to be fundamentally faster than all of the other standard algorithms that minimize the sum of $n$ smooth functions, such as SAGA, SAG, SDCA, and SDCA without duality. Most finite sum algorithms follow what we call the ``span assumption'': Their updates are in the span of a sequence of component gradients chosen in a random IID fashion. In the big data regime, where the condition number $\kappa=O(n)$, the span assumption prevents algorithms from converging to an approximate solution of accuracy $\epsilon$ in less than $n\ln(1/\epsilon)$ iterations. SVRG and SARAH do not follow the span assumption since they are updated with a hybrid of full-gradient and component-gradient information. We show that because of this, they can be up to $\Omega(1+(\ln(n/\kappa))_+)$ times faster. In particular, to obtain an accuracy $\epsilon = 1/n^\alpha$ for $\kappa=n^\beta$ and $\alpha,\beta\in(0,1)$, modified SVRG requires $O(n)$ iterations, whereas algorithms that follow the span assumption require $O(n\ln(n))$ iterations. Moreover, we present lower bound results that show this speedup is optimal, and provide analysis to help explain why this speedup exists. With the understanding that the span assumption is a point of weakness of finite sum algorithms, future work may purposefully exploit this to yield faster algorithms in the big data regime.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.46)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A New Analysis of Variance Reduced Stochastic Proximal Methods for Composite Optimization with Serial and Asynchronous Realizations

Yu, Yue, Huang, Longbo

arXiv.org Machine LearningMay-25-2018

We provide a comprehensive analysis of stochastic variance reduced gradient (SVRG) based proximal algorithms, both with and without momentum, in serial and asynchronous realizations. Specifically, we propose the Prox-SVRG$^{++}$ algorithm, and prove that it has a linear convergence rate with a small epoch length and we obtain an $O(1/\epsilon)$ complexity in non-strongly convex case. Then, we propose a momentum accelerated algorithm, called Prox-MSVRG$^{++}$, and show that it achieves a complexity of $O(1/\sqrt{\epsilon})$. After that, we develop two asynchronous versions based on the above serial algorithms and provide a general analysis under nonconvex and non-strongly convex cases respectively. Our theoretical results indicate that the algorithms can achieve a potentially significant speedup when implemented with multiple servers. We conduct extensive experiments based on $4$ real-world datasets on an experiment platform with $11$ physical machines. The experiments validate our theoretical findings and demonstrate the effectiveness of our algorithms.

algorithm, artificial intelligence, machine learning, (13 more...)

arXiv.org Machine Learning

1805.10111

Genre: Research Report > New Finding (1.00)

Technology: