exp null
- Asia > China > Hubei Province > Wuhan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
- Health & Medicine > Therapeutic Area > Immunology (0.46)
Appendix for " Fine-Grained Theoretical Analysis of Federated Zeroth-Order Optimization "
The main notations of this paper are summarized in Table 1. Table 1: Descriptions of the main notations used in this work.Notations Descriptions N, n the total number of clients and the total sample number of each client S, S We first introduce the lemmas which will be used in our proofs. Let e be the base of the natural logarithm. The stated result in Part (b) is proved. The optimization bound is given.
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- North America > United States > Indiana > Monroe County > Bloomington (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.45)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)
Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian
Zhang, Yiran, Xu, Weihang, Zhou, Mo, Fazel, Maryam, Du, Simon Shaolei
Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a $1/τ$ rate, where $τ$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Asia > China > Beijing > Beijing (0.04)
Translation-equivariant Representation in Recurrent Networks with a Continuous Manifold of Attractors: Supplementary Information Wen-Hao Zhang
Based on the requirement of equivariant representation (Eq. Since the translation is continuous, the amount of translation can be made infinitesimally small. Also in Eq. (S3) we define ˆ p Differentiating the above equation we can derive a differential form of a translation operator, d ˆ T (a) da = ˆp exp( a ˆ p) = ˆ p ˆ T ( a). (S7) If the Gaussian ansatz was correct, based on Eq. (8a) they should satisfy that u( x s) = ρ null W We performed perturbative analysis to analyze the stability of the CAN dynamics. Substituting above equation into the modified CAN dynamics (Eq. Therefore, Eq. (S16) can be simplified as, τ t u(x s) + τ null S15e), we could project the Eq. ( n 2) Similar with the analysis in the CAN, we propose the following Gaussian ansatz of network's For simplicity, we assume the speed neurons' responses The projection is computing the inner product between the network dynamics (Eq.
- North America > United States (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)