spectrum
Linearization Explains Fine-Tuning in Large Language Models
Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain underexplored. In this paper, we provide several insights into such fine-tuning through the lens of linearization. Fine-tuned models are often implicitly encouraged to remain close to the pretrained model. By making this explicit, using an ℓ2distance inductive bias in parameter space, we show that fine-tuning dynamics become equivalent to learning with the positive-definite neural tangent kernel (NTK). We specifically analyze how close the fully linear and the linearized finetuning optimizations are, based on the strength of the regularization. This allows us to be pragmatic about how good a model linearization is when fine-tuning large language models (LLMs). When linearization is a good model, our findings reveal a strong correlation between the eigenvalue spectrum of the NTK and the performance of model adaptation. Motivated by this, we give spectral perturbation bounds on the NTK induced by the choice of layers selected for fine-tuning.
Atomic Diffusion Models for Small Molecule Structure Elucidation from NMRSpectra
Nuclear Magnetic Resonance (NMR) spectroscopy is a cornerstone technique for determining the structures of small molecules and is especially critical in the discovery of novel natural products and clinical therapeutics. Yet, interpreting NMR spectra remains a time-consuming, manual process requiring extensive domain expertise. We introduce CHEFNMR (CHemical Elucidation From NMR), an endto-end framework that directly predicts an unknown molecule's structure solely from its 1DNMR spectra and chemical formula. We frame structure elucidation as conditional generation from an atomic diffusion model built on a non-equivariant transformer architecture. To model the complex chemical groups found in natural products, we generated a dataset of simulated 1DNMR spectra for over 111,000 natural products. CHEFNMR predicts the structures of challenging natural product compounds with an unsurpassed accuracy of over 65%. This work takes a significant step toward solving the grand challenge of automating small-molecule structure elucidation and highlights the potential of deep learning in accelerating molecular discovery.
Mitigating Spurious Features in Contrastive Learning with Spectral Regularization
Neural networks generally prefer simple and easy-to-learn features. When these features are spuriously correlated with the labels, the network's performance can suffer, particularly for underrepresented classes or concepts. Self-supervised representation learning methods, such as contrastive learning, are especially prone to this issue, often resulting in worse performance on downstream tasks. We identify a key spectral signature of this failure: early reliance on dominant singular modes of the learned feature matrix. To mitigate this, we propose a novel framework that promotes a uniform eigenspectrum of the feature covariance matrix, encouraging diverse and semantically rich representations. Our method operates in a fully self-supervised setting, without relying on ground-truth labels or any additional information. Empirical results on SimCLR and SimSiam demonstrate consistent gains in robustness and transfer performance, suggesting broad applicability across self-supervised learning paradigms.
Bridging Scales: Spectral Theory Reveals How Local Connectivity Rules Sculpt Global Neural Dynamics in Spatially Extended Networks
The brain's diverse spatiotemporal activity patterns are fundamental to cognition and consciousness, yet how these macroscopic dynamics emerge from microscopic neural circuitry remains a critical challenge. We take a step in this direction by developing a spatially extended neural network model integrated with a spectral theory of its connectivity matrix. Our theory quantitatively demonstrates how local structural parameters, such as E/I neuron projection ranges, connection strengths, and density determine distinct features of the eigenvalue spectrum, specifically outlier eigenvalues and a bulk disk. These spectral signatures, in turn, precisely predict the network's emergent global dynamical regime, encompassing asynchronous states, synchronous states, oscillations, localized activity bumps, traveling waves, and chaos. Motivated by observations of shifting cortical dynamics in mice across arousal states, our framework not only provides a possible explanation for repertoire of behaviors but also offers a principled starting point for inferring underlying effective connectivity changes from macroscopic brain activity. By mechanistically linking neural structure to dynamics, this work advances a principled framework for dissecting how large-scale activity patterns--central to cognition and open questions in consciousness research--arise from, and constrain, local circuitry.
Fourier Clouds: Fast Bias Correction for Imbalanced Semi-Supervised Learning
Pseudo-label-based Semi-Supervised Learning (SSL) often suffers from classifier bias, particularly under class imbalance, as inaccurate pseudo-labels tend to exacerbate existing biases towards majority classes. Existing methods, such as \textit{CDMAD}\cite{cdmad}, utilize simplistic reference inputs--typically uniform or blank-colored images--to estimate and correct this bias. However, such simplistic references fundamentally ignore realistic statistical information inherent to real datasets, specifically typical color distributions, texture details, and frequency characteristics. This lack of \emph{statistical representativeness} can lead the model to inaccurately estimate its inherent bias, limiting the effectiveness of bias correction, particularly under severe class imbalance or substantial distribution mismatches between labeled and unlabeled datasets. To overcome these limitations, we introduce the \textbf{FARAD} (Fourier-Adapted Reference for Accurate Debiasing) System.
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MIXAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MIXAT across a wide spectrum of state-of-the-art attacks, proposing the *At Least One Attack Success Rate* (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MIXAT achieves substantially better robustness (ALO-ASR $ < 20\%$) compared to prior defenses (ALO-ASR $> 50\%$), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MIXAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MIXAT discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs.
Mamba Modulation: On the Length Generalization of Mamba Models
The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behavior of its state-space dynamics, particularly within the parameterization of the state transition matrix $A$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N{\Delta}_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $A$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $A$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating ${\Delta}_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
On the Optimizer Dependence of Neural Scaling Laws
Ramani, Vansh, Jain, Shourya Vir
The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.
Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Moniri, Behrad, Hassani, Hamed
We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent on the first layer weights in this regime, such one-step update schemes are fundamentally limited: the update to the weights is approximately rank-one, captures only a single direction, and requires the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the \textit{second step} of gradient descent with step-sizes $η_1\asymp N^{α_1}$ and $η_2 \asymp N^{α_2}$ for $α_1, α_2 \in [0,0.5)$, where $N$ is the number of hidden neurons. We derive a spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of the outliers is determined by the parameters $α_1, α_2$ through $\lfloor \frac{α_2}{1/2 - α_1} \rfloor$. Furthermore, by analyzing the alignment between the learned directions and the target function, we identify a gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, provided that $α_1, α_2$ are chosen properly. This shows that the benefits of batch reuse, previously observed in narrow-width regimes, persist in the linear-width limit as well. By characterizing these early-phase evolutions, our work proposes a tractable framework for studying optimization and feature learning phenomenology in modern overparameterized networks.