Goto

Collaborating Authors

 derivation


Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

arXiv.org Machine Learning

We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.


Extending Kernel Trick to Influence Functions

arXiv.org Machine Learning

In this paper, we present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a matrix, whose size grows with the product of model output dimension and dataset size.


Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

arXiv.org Machine Learning

Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a naïve Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.



Model Derivation We write the joint posterior as, 2|Y,Z/ (Y|X,, 2) (| 2) (2) (13) / (2) N/2exp(1 2 2 (Y Z)Tdiag(x(Z)) (Y Z)) (2) 1exp(1 2 2 T) (2) (1+

Neural Information Processing Systems

The intermediate steps can be found in [67]. Derivation of Posterior Predictive Note, this derivation takes the priors to be set as in BayesLIME or BayesSHAP, namely, with values close to zero. We apply the identity from equation 17 to derive this posterior. In these derivations, the perturbation matrices Z have elements Zij 2{ 0,1} where each Zij Bernoulli(0.5). Note, in these proofs, we take take the priors to be set as in BayesLIME and BayesSHAP, i.e., they have hyperparameter values close to 0. B.1 Proof of Theorem 3.3 Note that we use N to denote the total perturbations while S denotes the perturabtions collected so far.



Supplementary Material for Grammar-Based Grounded Lexicon Learning

Neural Information Processing Systems

In the supplementary material, we describe the domain specific languages used in our experiments (Section 1), demonstrate how the proposed CKY-E2 method works by a concrete example (Section 2.1), show formal properties of CKY-E2 (Section 2.2), present dataset setups and analyze model behaviors (Section 3), and list environmental details for experiments (Section??). In this section, we will present and discuss the domain-specific languages (DSLs) we use for two domains: visual reasoning and language-guided navigation. We will further introduce the neurosymbolic module we have designed for executing programs in these two domains. Overall, each DSL contains a set of types and a set of deterministic modules that have been manually designed for realizing necessary operations in these domains. However, in contrast to realizing them as we do in standard programming languages (with for-loops and if-conditions), we will be using tensor operations (e.g., tensor additions and multiplications) to realize them so that the output of each program is differentiable with respect to all of its inputs. We refer readers to the original papers for a detailed introduction to the DSL and neuro-symbolic program execution. Here we only highlight the key aspects of our language and its neuro-symbolic realization, and discuss the difference between our implementation and the ones in original papers. Our visual reasoning DSL is a subset of CLEVR, containing 6 types and 8 primitive operations. Table 1 illustrates all 6 types and how they are internally represented in neuro-symbolic execution. Table 2 further shows all operations in the DSL. There are two main differences between the DSL used by G2L2 and the original CLEVRDSL.



2e9f978b222a956ba6bdf427efbd9ab3-Supplemental.pdf

Neural Information Processing Systems

B.3 Derivations of Eq. (19) Similar to derivation above, we give the gradient with respect to weight vector w RM+, which is given by wDKL = w log Z(U,w) wEU,w (log pθ(X |z))T1N + wEU,w (log pθ(U |z))Tw . The learning rate of each stochastic gradient descent step is γt t 1, where t {1,,T}denotes the iteration for optimization. We already report the t-SNE visualization of ByPE-VAE and standard VAE in Figure. Here we give more t-SNE visualization results. First, we randomly sample from ByPE-VAEs trained on different datasets, namely, MNIST, Fashion MNIST, and Celeba, as shown in Fig.7.


derivation of Eqs . 3 and 5

Neural Information Processing Systems

A.1 Derivation of Eq. (3) By expanding Eq. (2) with the definition of εli,t = xli,t µli,t, we have: Et = We note that each xli,t influences Et in two ways: (i) it occurs in Eq. (6) explicitly, but (ii) it also determines the values of µl 1k,t via Eq. Considering also the special cases of l = Land l = 0, we obtain Eq. (3). We note that θl+1i,j affects the value of the function Et of Eq. (6) by influencing µli,t via Eq. Here, we provide further details about training PCNs, useful to reproduce them. Furthermore, we have applied a decay factor of 0.9 to γ, applied each time the energy failed to decrease.