Goto

Collaborating Authors

 polynomially





05e2a0647e260c355dd2b2175edb45b8-Supplemental.pdf

Neural Information Processing Systems

One classical example is that the Riemannian manifoldEm = (Rm,h,iRm) is nothing but the m-dimensionalEuclideanspace. Assume that the heat kernel is Lipschitz continous. We first construct a sequence of probability measures{ρt}t N, such that ρ2t = µtφφφ, ρ2t+1 = µtM, t N. Proof For a given manifoldM, its Riemannian volume element doesn't vary at different time t, thus thelowerbound ofintegral R


Faster Algorithms for User-Level Private Stochastic Convex Optimization

Neural Information Processing Systems

We study private stochastic convex optimization (SCO) under user-level differential privacy (DP) constraints. In this setting, there are $n$ users (e.g., cell phones), each possessing $m$ data items (e.g., text messages), and we need to protect the privacy of each user's entire collection of data items. Existing algorithms for user-level DP SCO are impractical in many large-scale machine learning scenarios because: (i) they make restrictive assumptions on the smoothness parameter of the loss function and require the number of users to grow polynomially with the dimension of the parameter space; or (ii) they are prohibitively slow, requiring at least $(mn)^{3/2}$ gradient computations for smooth losses and $(mn)^3$ computations for non-smooth losses. To address these limitations, we provide novel user-level DP algorithms with state-of-the-art excess risk and runtime guarantees, without stringent assumptions. First, we develop a linear-time algorithm with state-of-the-art excess risk (for a non-trivial linear-time algorithm) under a mild smoothness assumption. Our second algorithm applies to arbitrary smooth losses and achieves optimal excess risk in $\approx (mn)^{9/8}$ gradient computations. Third, for non-smooth loss functions, we obtain optimal excess risk in $n^{11/8} m^{5/4}$ gradient computations. Moreover, our algorithms do not require the number of users to grow polynomially with the dimension.


Can SGD Learn Recurrent Neural Networks with Provable Generalization?

Neural Information Processing Systems

Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis. Yet, in the foundational PAC learning language, what concept class can it learn? Moreover, how can the same recurrent unit simultaneously learn functions from different input tokens to different output tokens, without affecting each other?


The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Neural Information Processing Systems

Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD's final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGD's final iterate with any polynomially decaying learning rate scheme is highly sub-optimal compared to the statistical minimax rate (by a condition number factor in the strongly convex case and a factor of $\sqrt{T}$ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from the statistical minimax rate by only log factors (in the condition number for the strongly convex case, and in T in the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the step size scheme employed. These results demonstrate the subtlety in establishing optimal learning rate schedules (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.



A Riemannian Manifold Definition 3 (Manifold) [36] Let M

Neural Information Processing Systems

Assume that the heat kernel is Lipschitz continous. Proof We start by introducing the following lemma, which is Proposition 4.4 in [20]. Following some previous work, we first define the projection operator. We then introduce the following lemma which utilize the projection operator. Readers who are interested may also refer to Chapter 5.3 in [8] and proof C (null) 0 as null 0, and Γ is the gamma function.


R1: Q1: It is hard to understand what Remark 6 conveys

Neural Information Processing Systems

R1: Q1: It is hard to understand what Remark 6 conveys. A: Y es, the error bound condition refers to the inequality in Lemma 1. Lemma 1 implies that the error bound condition R1: Q2: how the bound can affect or guide a specific choice of K in stagewise SGD. Theoretically, the testing error bound (e.g., In practice, it is just a small number. R1: Q3: It might be better to use "ST ART" in the paper title and Figure 1, instead of "SGD". We will make the change.