Goto

Collaborating Authors

 theorem 21



2639ba2137371773aa1e64e7735cdb30-Supplemental.pdf

Neural Information Processing Systems

Supplementary information for "Limiting fluctuation and trajectorial stability of multilayer neural networks with mean field training": Appendix A introduces several preliminaries: new notations and known results from [ 17 ]. Appendix B studies the Gaussian component G and proves Theorem 2 . Appendix C presents the proof of Theorem 3 for well-posedness of R . Appendix D proves Theorem 5 that connects the neural network with the second-order MF limit at the fluctuation level. Appendix E proves Theorem 6 that establishes a central limit theorem for the output fluctuation.


Limiting fluctuation and trajectorial stability of multilayer neural networks with mean field training

Pham, Huy Tuan, Nguyen, Phan-Minh

arXiv.org Machine Learning

The mean field (MF) theory of multilayer neural networks centers around a particular infinite-width scaling, where the learning dynamics is closely tracked by the MF limit. A random fluctuation around this infinite-width limit is expected from a large-width expansion to the next order. This fluctuation has been studied only in shallow networks, where previous works employ heavily technical notions or additional formulation ideas amenable only to that case. Treatment of the multilayer case has been missing, with the chief difficulty in finding a formulation that captures the stochastic dependency across not only time but also depth. In this work, we initiate the study of the fluctuation in the case of multilayer networks, at any network depth. Leveraging on the neuronal embedding framework recently introduced by Nguyen and Pham, we systematically derive a system of dynamical equations, called the second-order MF limit, that captures the limiting fluctuation distribution. We demonstrate through the framework the complex interaction among neurons in this second-order MF limit, the stochasticity with cross-layer dependency and the nonlinear time evolution inherent in the limiting fluctuation. A limit theorem is proven to relate quantitatively this limit to the fluctuation of large-width networks. We apply the result to show a stability property of gradient descent MF training: in the large-width regime, along the training trajectory, it progressively biases towards a solution with "minimal fluctuation" (in fact, vanishing fluctuation) in the learned output function, even after the network has been initialized at or has converged (sufficiently fast) to a global optimum. This extends a similar phenomenon previously shown only for shallow networks with a squared loss in the ERM setting, to multilayer networks with a loss function that is not necessarily convex in a more general setting.


Embeddings of Persistence Diagrams into Hilbert Spaces

Bubenik, Peter, Wagner, Alexander

arXiv.org Machine Learning

Since persistence diagrams do not admit an inner product structure, a map into a Hilbert space is needed in order to use kernel methods. It is natural to ask if such maps necessarily distort the metric on persistence diagrams. We show that persistence diagrams with the bottleneck distance do not even admit a coarse embedding into a Hilbert space. As part of our proof, we show that any separable, bounded metric space isometrically embeds into the space of persistence diagrams with the bottleneck distance. As corollaries, we obtain the generalized roundness, negative type, and asymptotic dimension of this space.


Boosted Density Estimation Remastered

Cranko, Zac, Nock, Richard

arXiv.org Machine Learning

There has recently been a steadily increase in the iterative approaches to boosted density estimation and sampling, usually proceeding by adding candidate "iterate" densities to a model that gets more accurate with iterations. The relative accompanying burst of formal convergence results has not yet changed a striking picture: all results essentially pay the price of heavy assumptions on iterates, often unrealistic or hard to check, and offer a blatant contrast with the original boosting theory where such assumptions would be the weakest possible. In this paper, we show that all that suffices to achieve boosting for \textit{density estimation} is a \emph{weak learner} in the original boosting theory sense, that is, an oracle that supplies \textit{classifiers}. We provide converge rates that comply with boosting requirements, being better and / or relying on substantially weaker assumptions than the state of the art. One of our rates is to our knowledge the first to rely on not just weak but also \textit{empirically testable} assumptions. We show that the model fit belongs to exponential families, and obtain in the course of our results a variational characterization of $f$-divergences better than $f$-GAN's. Experimental results on several simulated problems display significantly better results than AdaGAN during early boosting rounds, in particular for mode capture, and using architectures less than the fifth's of AdaGAN's size.