psgd
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
EscapingSaddle-PointFasterunder Interpolation-likeConditions
One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, thefirst-order oracle complexityofPerturbed Stochastic Gradient Descent (PSGD) algorithm toreach an -local-minimizer,matches the corresponding deterministic rateof O(1/2).
- Asia > Middle East > Jordan (0.05)
- North America > United States > California > Yolo County > Davis (0.04)
- North America > Canada (0.04)
Escaping Saddle-Point Faster under Interpolation-like Conditions
In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $\epsilon$-local-minimizer, matches the corresponding deterministic rate of $O(1/\epsilon^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $\epsilon$-local-minimizer under interpolation-like conditions, is $O(1/\epsilon^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $O(1/\epsilon^{1.5})$
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California > Yolo County > Davis (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > China > Hong Kong (0.04)
Stochastic Optimization in Semi-Discrete Optimal Transport: Convergence Analysis and Minimax Rate
Genans, Ferdinand, Godichon-Baggioni, Antoine, Vialard, François-Xavier, Wintenberger, Olivier
We investigate the semi-discrete Optimal Transport (OT) problem, where a continuous source measure $μ$ is transported to a discrete target measure $ν$, with particular attention to the OT map approximation. In this setting, Stochastic Gradient Descent (SGD) based solvers have demonstrated strong empirical performance in recent machine learning applications, yet their theoretical guarantee to approximate the OT map is an open question. In this work, we answer it positively by providing both computational and statistical convergence guarantees of SGD. Specifically, we show that SGD methods can estimate the OT map with a minimax convergence rate of $\mathcal{O}(1/\sqrt{n})$, where $n$ is the number of samples drawn from $μ$. To establish this result, we study the averaged projected SGD algorithm, and identify a suitable projection set that contains a minimizer of the objective, even when the source measure is not compactly supported. Our analysis holds under mild assumptions on the source measure and applies to MTW cost functions,whic include $\|\cdot\|^p$ for $p \in (1, \infty)$. We finally provide numerical evidence for our theoretical results.
Monitoring State Transitions in Markovian Systems with Sampling Cost
Saurav, Kumar, Shroff, Ness B., Liang, Yingbin
We consider a node-monitor pair, where the node's state varies with time. The monitor needs to track the node's state at all times; however, there is a fixed cost for each state query. So the monitor may instead predict the state using time-series forecasting methods, including time-series foundation models (TSFMs), and query only when prediction uncertainty is high. Since query decisions influence prediction accuracy, determining when to query is nontrivial. A natural approach is a greedy policy that predicts when the expected prediction loss is below the query cost and queries otherwise. We analyze this policy in a Markovian setting, where the optimal (OPT) strategy is a state-dependent threshold policy minimizing the time-averaged sum of query cost and prediction losses. We show that, in general, the greedy policy is suboptimal and can have an unbounded competitive ratio, but under common conditions such as identically distributed transition probabilities, it performs close to OPT. For the case of unknown transition probabilities, we further propose a projected stochastic gradient descent (PSGD)-based learning variant of the greedy policy, which achieves a favorable predict-query tradeoff with improved computational efficiency compared to OPT.
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- Asia > India (0.04)
Position-based Scaled Gradient for Model Quantization and Pruning - Appendix
In this experiment, we only quantize the weights, not the activations, to compare the performance degradation as weight bit-width decreases. The mean squared errors (MSE) of the weights across different bit-widths are also reported. In Fig. A1, we display the full-precision weight distributions of the PSGD models and compare them Four random layers of each model are shown column-wise. The first row displays the model trained with SGD and L2 weight decay. This is also reported in Figure 1 of the original paper.
- Asia > South Korea > Seoul > Seoul (0.06)
- North America > Canada (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
Figure 5: Loss surface using [35]; SGD (top) and PSGD (bottom)
We thank the reviewers for their positive and constructive feedbacks. Note that our PSGD has a similar accuracy with the SGD-trained model at FP . A similar rationale is given in Sec. Note that at lower bits such as W2A8, we attain 62.7% accuracy, while LAPQ has 1.3% accuracy. The detailed definition and proof are in [38].