low-rank structure
Group-Aware Matrix Estimation and Latent Subspace Recovery
Golubovic, Hamza, Shen, Matthew, Allen, Genevera I., Zikry, Tarek M.
Modern matrix completion problems often involve heterogeneous data whose rows simultaneously belong to many meta-categories, such as demographic and age groups in recommendation systems, or region and recording session labels in neural electrophysiological experiments. Standard low-rank estimators impose a single global latent geometry, which can recover average structure but may smooth away subgroup-specific variation, especially when observations are unevenly distributed across groups. We introduce Group-Aware Matrix Estimation (GAME), a convex estimator for overlapping subgroup-wise low-rank matrix estimation. GAME regularizes category-specific submatrices through overlapping nuclear-norm penalties, allowing related groups to borrow information while preserving local latent structure in a shared coordinate system. We provide finite-sample guarantees for both reconstruction error and subgroup-specific subspace recovery, showing how performance depends on sampling density, subgroup rank, and overlap structure. Experiments on synthetic, recommendation, ecological, and neuroscience datasets show that GAME is most beneficial in structured missingness regimes, where subgroup-aware regularization improves both reconstruction accuracy and latent subspace fidelity. Across these benchmarks, GAME is competitive or best among global low-rank, side-information, and modern imputation baselines, with the largest gains when subgroups exhibit distinct low-rank structure.
SLTrain: a sparse plus low rank approach for parameter and memory efficient pretraining
Large language models (LLMs) have shown impressive capabilities across various tasks. However, training LLMs from scratch requires significant computational power and extensive memory capacity. Recent studies have explored low-rank structures on weights for efficient fine-tuning in terms of parameters and memory, either through low-rank adaptation or factorization. While effective for fine-tuning, low-rank structures are generally less suitable for pretraining because they restrict parameters to a low-dimensional subspace. In this work, we propose to parameterize the weights as a sum of low-rank and sparse matrices for pretraining, which we call SLTrain. The low-rank component is learned via matrix factorization, while for the sparse component, we employ a simple strategy of uniformly selecting the sparsity support at random and learning only the non-zero entries with the fixed support. While being simple, the random fixed-support sparse learning strategy significantly enhances pretraining when combined with low-rank learning. Our results show that SLTrain adds minimal extra parameters and memory costs compared to pretraining with low-rank parameterization, yet achieves substantially better performance, which is comparable to full-rank training. Remarkably, when combined with quantization and per-layer updates, SLTrain can reduce memory requirements by up to 73% when pretraining the LLaMA 7B model.
Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement Learning
We study matrix estimation problems arising in reinforcement learning with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it characterizes the transition kernel of the MDP. In both cases, each entry of the matrix carries important information, and we seek estimation methods with low entry-wise prediction error. Importantly, these methods further need to accommodate for inherent correlations in the available data (e.g. for MDPs, the data consists of system trajectories). We investigate the performance of simple spectral-based matrix estimation approaches: we show that they efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise prediction error. These new results on low-rank matrix estimation make it possible to devise reinforcement learning algorithms that fully exploit the underlying low-rank structure. We provide two examples of such algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for low-rank MDPs. Both algorithms yield state-of-the-art performance guarantees.
Generalization Bounds for Rank-sparse Neural Networks
Ledent, Antoine, Alves, Rodrigo, Lei, Yunwen
It has been recently observed in much of the literature that neural networks exhibit a bottleneck rank property: for larger depths, the activation and weights of neural networks trained with gradient-based methods tend to be of approximately low rank. In fact, the rank of the activations of each layer converges to a fixed value referred to as the ``bottleneck rank'', which is the minimum rank required to represent the training data. This perspective is in line with the observation that regularizing linear networks (without activations) with weight decay is equivalent to minimizing the Schatten $p$ quasi norm of the neural network. In this paper we investigate the implications of this phenomenon for generalization. More specifically, we prove generalization bounds for neural networks which exploit the approximate low rank structure of the weight matrices if present. The final results rely on the Schatten $p$ quasi norms of the weight matrices: for small $p$, the bounds exhibit a sample complexity $ \widetilde{O}(WrL^2)$ where $W$ and $L$ are the width and depth of the neural network respectively and where $r$ is the rank of the weight matrices. As $p$ increases, the bound behaves more like a norm-based bound instead.
Sequences of Logits Reveal the Low Rank Structure of Language Models
Golowich, Noah, Liu, Allen, Shetty, Abhishek
A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.
On the Neural Feature Ansatz for Deep Neural Networks
Tansley, Edward, Massart, Estelle, Cartis, Coralia
Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power $α>0$ of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent $α= 1/2$ (Radhakrishnan et al., 2024). We extend this result to networks with $L \geq 2$ layers, showing that the NFA holds with exponent $α= 1/L$, thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.