Goto

Collaborating Authors

 reparameterization


Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

arXiv.org Machine Learning

We study adversarial online learning with hidden-convex losses, i.e., nonconvex losses that become convex after a nonlinear reparameterization. Ghai, Lu and Hazan (2022) proved that, under geometric and smoothness assumptions, online gradient descent (OGD) on such nonconvex losses approximately simulates online mirror descent (OMD) on the underlying convex losses with a suitable regularizer, yielding $\mathcal{O}(T^{2/3})$ regret. They left open whether the optimal $Θ(\sqrt{T})$ regret from online convex optimization can be recovered in this hidden-convex setting. We answer this question affirmatively. More specifically, via a sharper discrete-time algorithmic equivalence argument, we prove that OGD achieves $\mathcal{O}(\sqrt{T})$ regret under the same assumptions, matching the optimal worst-case rate for adversarial online convex optimization. We also address another open question of Ghai, Lu and Hazan (2022) by clarifying the geometry required for this algorithmic equivalence. We replace the diagonal-Jacobian sufficient condition with a necessary-and-sufficient Hessian compatibility condition, thereby expanding the class of admissible reparameterizations. We complement our tight regret bound with a lower bound showing that the Hessian compatibility assumption is essential for OGD; when it fails, we construct a smooth reparameterization and an adversarial sequence of hidden-convex losses for which OGD suffers $Ω(T)$ regret. Finally, we extend our analysis to one-point bandit feedback and prove a $\mathcal{O}(T^{3/4})$ expected regret bound for bandit OGD with spherical smoothing, matching its classical rate on convex losses.


Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

arXiv.org Machine Learning

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.


Inference by Reparameterization in Neural Population Codes

Neural Information Processing Systems

Behavioral experiments on humans and animals suggest that the brain performs probabilistic inference to interpret its environment. Here we present a new generalpurpose, biologically-plausible neural implementation of approximate inference. The neural network represents uncertainty using Probabilistic Population Codes (PPCs), which are distributed neural representations that naturally encode probability distributions, and support marginalization and evidence integration in a biologically-plausible manner. By connecting multiple PPCs together as a probabilistic graphical model, we represent multivariate probability distributions. Approximate inference in graphical models can be accomplished by message-passing algorithms that disseminate local information throughout the graph. An attractive and often accurate example of such an algorithm is Loopy Belief Propagation (LBP), which uses local marginalization and evidence integration operations to perform approximate inference efficiently even for complex models.


Appendix

Neural Information Processing Systems

We first introduce some handy concepts and results to make the proof succinct, meanwhile providing more information for understanding our model and theory. We begin with some extended discussions on CSG. Note that a reparameterization unnecessarily has its output dimensions in S, i.e. The condition that p(y|s) = p0(y|ΦS(s,v)) for any v V does not indicate that ΦS(s,v) is constant of v, since p0(y|s0) may ignore the change of s0 = ΦS(s,v) from the change of v. The following lemma shows the meaning of a reparameterization: it allows a CSG to vary while inducing the same distribution on the observed data variables (x,y) (i.e., holding the same effect on describing data). We can now define and verify an equivalent relation on CSGs so that the resulting equivalent class contains CSGs that induce the same (x,y) data distribution and hold the same semantic information in their svariables. We say two CSGs pand p0 are semantic-equivalent, if there exists a homeomorphism11 Φ on S V, such that (i) is semantic-preserving: its output dimensions in S is constant of v, ΦS(s,v) = ΦS(s) for any v V, and (ii) it acts as a reparameterization from p to p0: Φ#[ps,v] = p0s,v, p(x|s,v) = p0(x|Φ(s,v)) and p(y|s) = p0(y|ΦS(s)). A.1 below shows that the defined binary relation is indeed an equivalence relation in common cases. As a reparameterization, Φ allows the two models to have different latent-variable parameterizations while inducing the same distribution on the observed data variables (x,y) (Lemma 9). This definition of semantic-equivalence can be rephrased as the existence of a semantic-preserving reparameterization. With proper model assumptions, we can show that any reparameterization between two CSGs is semantic-preserving, so that semantic-preserving CSGs cannot be converted to each other by a reparameterization that mixes swith v. Lemma 11. For two CSGs pand p0, if p0(y|s) has a statistics M0(s) that is an injective function of s, then any reparameterization Φ from pto p0, if exists, has its ΦS constant of v. Proof. Then the condition that p(y|s) = p0(y|ΦS(s,v)) for any v V indicates that M(s) = M0(ΦS(s,v)). If there exist s S and v(1) 6= v(2) V such that ΦS(s,v(1)) 6= ΦS(s,v(2)), then M0(ΦS(s,v(1))) 6= M0(ΦS(s,v(2))) 11A transformation is a homeomorphism if it is a continuous bijection with continuous inverse. This violates M(s) = M0(ΦS(s,v)) which requires both M0(ΦS(s,v(1))) and M0(ΦS(s,v(2))) to be equal to M(s). We then introduce two mathematical facts. Let z be a random variable on a Euclidean space RdZ with density function pz(z), and let Φ be a homeomorphism on RdZ whose inverse Φ 1 is differentiable.


Mirror Descent on Riemannian Manifolds

arXiv.org Machine Learning

Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.


Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Neural Information Processing Systems

By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.


Inference by Reparameterization in Neural Population Codes

Neural Information Processing Systems

Behavioral experiments on humans and animals suggest that the brain performs probabilistic inference to interpret its environment. Here we present a new general-purpose, biologically-plausible neural implementation of approximate inference. The neural network represents uncertainty using Probabilistic Population Codes (PPCs), which are distributed neural representations that naturally encode probability distributions, and support marginalization and evidence integration in a biologically-plausible manner. By connecting multiple PPCs together as a probabilistic graphical model, we represent multivariate probability distributions. Approximate inference in graphical models can be accomplished by message-passing algorithms that disseminate local information throughout the graph. An attractive and often accurate example of such an algorithm is Loopy Belief Propagation (LBP), which uses local marginalization and evidence integration operations to perform approximate inference efficiently even for complex models.



A Derivations of Variance Controlled Diffusion

Neural Information Processing Systems

A.1 Proof of Proposition 4.1 Proposition 4.1 For any bounded measurable function τ(t): [0, T ] R, the following Reverse SDEs [ (1 + τ Eq. (20) is a reverse-time SDE running[ from T to 0, thus (there)are two additional minus ] signs in Eq. (21) before term A.2 Two Reparameterizations and Exact Solution under Exponential Integrator In this subsection, we will show the exact solution of SDE in both data prediction reparameterization and noise prediction reparameterization. The noise term in data prediction has smaller variance than noise prediction ones, implying the necessity of adopting data prediction reparameterization for the SDE sampler. The computation of variance uses the Itô Isometry, which is a crucial fact of Itô integral. Similar with Proposition 4.2, Eq. (37) can be solved analytically, which is shown in the following propositon: Following the derivation in Proposition 4.2, the mean of the Itô integral term is: [ A.2.4 Comparison between Data and Noise Reparameterizations In Table 1 we perform an ablation study on data and noise reparameterizations, the experiment results show that under the same magnitude of stochasticity, the proposed SA-Solver in data reparameterization has a better convergence which leads to better FID results under the same NFEs. In this subsection, we provide a theoretical view of this phenomenon.