Goto

Collaborating Authors

 agop


Breaking Data Symmetry is Needed For Generalization in Feature Learning Kernels

Bernal, Marcel Tomàs, Mallinar, Neil Rohit, Belkin, Mikhail

arXiv.org Machine Learning

Grokking occurs when a model achieves high training accuracy but generalization to unseen test points happens long after that. This phenomenon was initially observed on a class of algebraic problems, such as learning modular arithmetic (Power et al., 2022). We study grokking on algebraic tasks in a class of feature learning kernels via the Recursive Feature Machine (RFM) algorithm (Radhakrishnan et al., 2024), which iteratively updates feature matrices through the Average Gradient Outer Product (AGOP) of an estimator in order to learn task-relevant features. Our main experimental finding is that generalization occurs only when a certain symmetry in the training set is broken. Furthermore, we empirically show that RFM generalizes by recovering the underlying invariance group action inherent in the data. We find that the learned feature matrices encode specific elements of the invariance group, explaining the dependence of generalization on symmetry.



Average gradient outer product as a mechanism for deep neural collapse

Neural Information Processing Systems

Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. Deep Recursive Feature Machines are a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate theoretically and empirically that DNC occurs in Deep Recursive Feature Machines as a consequence of the projection with the AGOP matrix computed at each layer. We then provide evidence that this mechanism holds for neural networks more generally. We show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.



xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

Beaglehole, Daniel, Holzmüller, David, Radhakrishnan, Adityanarayanan, Belkin, Mikhail

arXiv.org Machine Learning

Tabular data - collections of continuous and categorical variables organized into matrices - underlies all aspects of modern commerce and science from airplane engines to biology labs to bagel shops. Yet, while Machine Learning and AI for language and vision have seen unprecedented progress, the primary methodologies of prediction from tabular data have been relatively static, dominated by variations of Gradient Boosted Decision Trees (GBDTs), such as XGBoost [7]. Nevertheless, hundreds of tabular datasets have been assembled to form extensive regression and classification benchmarks [11, 12, 16, 35, 37], and, recently, there has been renewed interest in building state-of-the-art predictive models for tabular data [15, 18, 19]. Notably, given the remarkable effectiveness of large, "foundation" models for text, there has been much excitement in developing similar models on tabular data, and recent effort has led to the development of TabPFN-v2, a foundation model for tabular data appearing in Nature [18]. Yet, despite this progress, tabular data still remains an active area for model development and building scalable, effective, and interpretable machine learning models in this domain remains an open challenge. In this work, we introduce xRFM, a tabular predictive model that combines recent advances in feature learning kernel machines with an adaptive tree structure, making it effective, scalable, and interpretable.


SimpliSafe Outdoor Security Camera 2 review: Barely an upgrade

PCWorld

SimpliSafe's new outdoor camera enables its new active response system, but it provides literally no other reason to upgrade from the previous camera. SimpliSafe is one of the most venerable smart home security companies, and while it regularly refreshes its hardware, it does so device by device, rather than upgrading the entire system at once. Makes sense, because it has at least 16 different components you can mix and match with your existing SimpliSafe base station or add on to one of its hardware bundles. The latest upgrade to the SimpliSafe family is a new version of the SimpliSafe Wireless Outdoor Security Camera, which was released in 2021. The SimpliSafe Outdoor Security Camera 2 keeps the overall look and feel of the original, while making a few changes that offer some compelling upgrades.


FACT: the Features At Convergence Theorem for neural networks

Boix-Adsera, Enric, Mallinar, Neil, Simon, James B., Belkin, Mikhail

arXiv.org Machine Learning

A central challenge in deep learning theory is to understand how neural networks learn and represent features. To this end, we prove the Features at Convergence Theorem (FACT), which gives a self-consistency equation that neural network weights satisfy at convergence when trained with nonzero weight decay. For each weight matrix $W$, this equation relates the "feature matrix" $W^\top W$ to the set of input vectors passed into the matrix during forward propagation and the loss gradients passed through it during backpropagation. We validate this relation empirically, showing that neural features indeed satisfy the FACT at convergence. Furthermore, by modifying the "Recursive Feature Machines" of Radhakrishnan et al. 2024 so that they obey the FACT, we arrive at a new learning algorithm, FACT-RFM. FACT-RFM achieves high performance on tabular data and captures various feature learning behaviors that occur in neural network training, including grokking in modular arithmetic and phase transitions in learning sparse parities.


Average gradient outer product as a mechanism for deep neural collapse

Neural Information Processing Systems

Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. Deep Recursive Feature Machines are a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map.


Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers

Beaglehole, Daniel, Radhakrishnan, Adityanarayanan, Boix-Adserà, Enric, Belkin, Mikhail

arXiv.org Machine Learning

A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always ``know what they know'' and may even be actively misleading. In this work, we give a general method for detecting semantic concepts in the internal activations of LLMs. Furthermore, we show that our methodology can be easily adapted to steer LLMs toward desirable outputs. Our innovations are the following: (1) we use a nonlinear feature learning method to identify important linear directions for predicting concepts from each layer; (2) we aggregate features across layers to build powerful concept detectors and steering mechanisms. We showcase the power of our approach by attaining state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content on seven benchmarks. We highlight the generality of our approach by steering LLMs towards new concepts that, to the best of our knowledge, have not been previously considered in the literature, including: semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, and even multiple concepts simultaneously. Moreover, our method can steer concepts with numerical attributes such as product reviews. We provide our code (including a simple API for our methods) at https://github.com/dmbeaglehole/neural_controllers .


Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Mallinar, Neil, Beaglehole, Daniel, Zhu, Libin, Radhakrishnan, Adityanarayanan, Pandit, Parthe, Belkin, Mikhail

arXiv.org Machine Learning

Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.