krzakala
All-or-nothingstatisticalandcomputationalphase transitionsinsparsespikedmatrixestimation
Similarly the ISOMAP face database consists ofimages (256levels ofgray)ofsize64 64,i.e.,vectors in R4096, whereas the correct intrinsic dimension is only3 (for the vertical, horizontal pause and lightingdirection). The second approach, is anaverage caseapproach (in the spirit of thestatistical mechanics treatment ofhighdimensional systems), thatmodelsfeaturevectorsby arandom ensemble,taken as aset ofrandom vectors with independently identically distributed (i.i.d.) components, and a small but xed fraction of non-zero components.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom (0.04)
- Asia > Middle East > Jordan (0.04)
- (6 more...)
All-or-nothingstatisticalandcomputationalphase transitionsinsparsespikedmatrixestimation
Similarly the ISOMAP face database consists ofimages (256levels ofgray)ofsize64 64,i.e.,vectors in R4096, whereas the correct intrinsic dimension is only3 (for the vertical, horizontal pause and lightingdirection). The second approach, is anaverage caseapproach (in the spirit of thestatistical mechanics treatment ofhighdimensional systems), thatmodelsfeaturevectorsby arandom ensemble,taken as aset ofrandom vectors with independently identically distributed (i.i.d.) components, and a small but xed fraction of non-zero components.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- (5 more...)
Optimal scaling laws in learning hierarchical multi-index models
Defilippis, Leonardo, Krzakala, Florent, Loureiro, Bruno, Maillard, Antoine
In this work, we provide a sharp theory of scaling laws for two-layer neural networks trained on a class of hierarchical multi-index targets, in a genuinely representation-limited regime. We derive exact information-theoretic scaling laws for subspace recovery and prediction error, revealing how the hierarchical features of the target are sequentially learned through a cascade of phase transitions. We further show that these optimal rates are achieved by a simple, target-agnostic spectral estimator, which can be interpreted as the small learning-rate limit of gradient descent on the first-layer weights. Once an adapted representation is identified, the readout can be learned statistically optimally, using an efficient procedure. As a consequence, we provide a unified and rigorous explanation of scaling laws, plateau phenomena, and spectral structure in shallow neural networks trained on such hierarchical targets.
- North America > United States (0.14)
- Europe > France (0.14)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- (2 more...)
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- (8 more...)
On the phase diagram of extensive-rank symmetric matrix denoising beyond rotational invariance
Barbier, Jean, Camilli, Francesco, Ko, Justin, Okajima, Koki
Matrix denoising is central to signal processing and machine learning. Its analysis when the matrix to infer has a factorised structure with a rank growing proportionally to its dimension remains a challenge, except when it is rotationally invariant. In this case the information theoretic limits and a Bayes-optimal denoising algorithm, called rotational invariant estimator [1,2], are known. Beyond this setting few results can be found. The reason is that the model is not a usual spin system because of the growing rank dimension, nor a matrix model due to the lack of rotation symmetry, but rather a hybrid between the two. In this paper we make progress towards the understanding of Bayesian matrix denoising when the hidden signal is a factored matrix $XX^\intercal$ that is not rotationally invariant. Monte Carlo simulations suggest the existence of a denoising-factorisation transition separating a phase where denoising using the rotational invariant estimator remains Bayes-optimal due to universality properties of the same nature as in random matrix theory, from one where universality breaks down and better denoising is possible by exploiting the signal's prior and factorised structure, though algorithmically hard. We also argue that it is only beyond the transition that factorisation, i.e., estimating $X$ itself, becomes possible up to sign and permutation ambiguities. On the theoretical side, we combine mean-field techniques in an interpretable multiscale fashion in order to access the minimum mean-square error and mutual information. Interestingly, our alternative method yields equations which can be reproduced using the replica approach of [3]. Using numerical insights, we then delimit the portion of the phase diagram where this mean-field theory is reliable, and correct it using universality when it is not. Our ansatz matches well the numerics when accounting for finite size effects.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (7 more...)
Information limits and Thouless-Anderson-Palmer equations for spiked matrix models with structured noise
Barbier, Jean, Camilli, Francesco, Mondelli, Marco, Xu, Yizhou
We consider a prototypical problem of Bayesian inference for a structured spiked model: a low-rank signal is corrupted by additive noise. While both information-theoretic and algorithmic limits are well understood when the noise is a Gaussian Wigner matrix, the more realistic case of structured noise still proves to be challenging. To capture the structure while maintaining mathematical tractability, a line of work has focused on rotationally invariant noise. However, existing studies either provide sub-optimal algorithms or are limited to special cases of noise ensembles. In this paper, using tools from statistical physics (replica method) and random matrix theory (generalized spherical integrals) we establish the first characterization of the information-theoretic limits for a noise matrix drawn from a general trace ensemble. Remarkably, our analysis unveils the asymptotic equivalence between the rotationally invariant model and a surrogate Gaussian one. Finally, we show how to saturate the predicted statistical limits using an efficient algorithm inspired by the theory of adaptive Thouless-Anderson-Palmer (TAP) equations.
- Europe > Austria (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
- (2 more...)
Optimal thresholds and algorithms for a model of multi-modal learning in high dimensions
Keup, Christian, Zdeborová, Lenka
This work explores multi-modal inference in a high-dimensional simplified model, analytically quantifying the performance gain of multi-modal inference over that of analyzing modalities in isolation. We present the Bayes-optimal performance and weak recovery thresholds in a model where the objective is to recover the latent structures from two noisy data matrices with correlated spikes. The paper derives the approximate message passing (AMP) algorithm for this model and characterizes its performance in the high-dimensional limit via the associated state evolution. The analysis holds for a broad range of priors and noise channels, which can differ across modalities. The linearization of AMP is compared numerically to the widely used partial least squares (PLS) and canonical correlation analysis (CCA) methods, which are both observed to suffer from a sub-optimal recovery threshold.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
Rigorous dynamical mean field theory for stochastic gradient descent methods
Gerbelot, Cedric, Troiani, Emanuele, Mignacco, Francesca, Krzakala, Florent, Zdeborova, Lenka
We prove closed-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator (e.g. M-estimator, shallow neural network, ...) from observations on Gaussian data with empirical risk minimization. This includes widely used algorithms such as stochastic gradient descent (SGD) or Nesterov acceleration. The obtained equations match those resulting from the discretization of dynamical mean-field theory (DMFT) equations from statistical physics when applied to gradient flow. Our proof method allows us to give an explicit description of how memory kernels build up in the effective dynamics, and to include non-separable update functions, allowing datasets with non-identity covariance matrices. Finally, we provide numerical implementations of the equations for SGD with generic extensive batch-size and with constant learning rates.
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (4 more...)
On double-descent in uncertainty quantification in overparametrized models
Clarté, Lucas, Loureiro, Bruno, Krzakala, Florent, Zdeborová, Lenka
Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Gradient flow on extensive-rank positive semi-definite matrix denoising
Bodin, Antoine, Macris, Nicolas
In this work, we present a new approach to analyze the gradient flow for a positive semi-definite matrix denoising problem in an extensive-rank and high-dimensional regime. We use recent linear pencil techniques of random matrix theory to derive fixed point equations which track the complete time evolution of the matrix-mean-square-error of the problem. The predictions of the resulting fixed point equations are validated by numerical experiments. In this short note we briefly illustrate a few predictions of our formalism by way of examples, and in particular we uncover continuous phase transitions in the extensive-rank and high-dimensional regime, which connect to the classical phase transitions of the low-rank problem in the appropriate limit. The formalism has much wider applicability than shown in this communication.
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (2 more...)