Goto

Collaborating Authors

 pre


On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

Tinati, Mohammad, Tu, Stephen

arXiv.org Machine Learning

Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.


On the Convergence of Encoder-only Shallow Transformers

Neural Information Processing Systems

Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization.




A Concept uniqueness and granularity

Neural Information Processing Systems

Here, we report statistics about the uniqueness of neuron concepts, as we increase the maximum formula length of our explanations. Figure S1: Number of repeated concepts across probed vision and NLI models, by maximum formula length. Table S1: For probed Image Classification and NLI models, average number of occurrences of each detected concept and percentage of detected concepts that are unique (i.e. A.1 Image Classification Figure S1 (left) plots the number of times each unique concept appears across the 512 units of ResNet-18 as the maximum formula length increases. Table S1 displays the mean number of occurrences per concept, and percentage of concepts occurring that are unique (i.e.



492114f6915a69aa3dd005aa4233ef51-Supplemental.pdf

Neural Information Processing Systems

A deterministic path uses a self-attention and cross-attention to summarize contexts. B.1 1DRegression Architectures For models without attention (CNP, NP, BNP), we set`pre = 4,`post = 2,`dec = 3,dh = 128. For NP we set dz = 128. For Student-t noise, we addedε γ T(2.1) to the curves generated from GP with RBF kernel, whereT(2.1) is a Student'st distribution with degree of freedom2.1 and γ Unif(0,0.15). After realizing them, the prior functions are used to optimize via Bayesian optimization.




Fine Tuning a Simulation-Driven Estimator

Lakshminarayanan, Braghadeesh, Guerrero, Margarita A., Rojas, Cristian R.

arXiv.org Machine Learning

Many industries now deploy high-fidelity simulators (digital twins) to represent physical systems, yet their parameters must be calibrated to match the true system. This motivated the construction of simulation-driven parameter estimators, built by generating synthetic observations for sampled parameter values and learning a supervised mapping from observations to parameters. However, when the true parameters lie outside the sampled range, predictions suffer from an out-of-distribution (OOD) error. This paper introduces a fine-tuning approach for the Two-Stage estimator that mitigates OOD effects and improves accuracy. The effectiveness of the proposed method is verified through numerical simulations.