xor
Spontaneous Kolmogorov-Arnold Geometry in Shallow MLPs
Freedman, Michael H., Mulligan, Michael
The Kolmogorov-Arnold (KA) representation theorem constructs universal, but highly non-smooth inner functions (the first layer map) in a single (non-linear) hidden layer neural network. Such universal functions have a distinctive local geometry, a "texture," which can be characterized by the inner function's Jacobian $J({\mathbf{x}})$, as $\mathbf{x}$ varies over the data. It is natural to ask if this distinctive KA geometry emerges through conventional neural network optimization. We find that indeed KA geometry often is produced when training vanilla single hidden layer neural networks. We quantify KA geometry through the statistical properties of the exterior powers of $J(\mathbf{x})$: number of zero rows and various observables for the minor statistics of $J(\mathbf{x})$, which measure the scale and axis alignment of $J(\mathbf{x})$. This leads to a rough understanding for where KA geometry occurs in the space of function complexity and model hyperparameters. The motivation is first to understand how neural networks organically learn to prepare input data for later downstream processing and, second, to learn enough about the emergence of KA geometry to accelerate learning through a timely intervention in network hyperparameters. This research is the "flip side" of KA-Networks (KANs). We do not engineer KA into the neural network, but rather watch KA emerge in shallow MLPs.
Learned feature representations are biased by complexity, learning order, position, and more
Lampinen, Andrew Kyle, Chan, Stephanie C. Y., Hermann, Katherine
Representation learning, and interpreting learned representations, are key areas of focus in machine learning and neuroscience. Both fields generally use representations as a means to understand or improve a system's computations. In this work, however, we explore surprising dissociations between representation and computation that may pose challenges for such efforts. We create datasets in which we attempt to match the computational role that different features play, while manipulating other properties of the features or the data. We train various deep learning architectures to compute these multiple abstract features about their inputs. We find that their learned feature representations are systematically biased towards representing some features more strongly than others, depending upon extraneous properties such as feature complexity, the order in which features are learned, and the distribution of features over the inputs. For example, features that are simpler to compute or learned first tend to be represented more strongly and densely than features that are more complex or learned later, even if all features are learned equally well. We also explore how these biases are affected by architectures, optimizers, and training regimes (e.g., in transformers, features decoded earlier in the output sequence also tend to be represented more strongly). Our results help to characterize the inductive biases of gradient-based representation learning. These results also highlight a key challenge for interpretability $-$ or for comparing the representations of models and brains $-$ disentangling extraneous biases from the computationally important aspects of a system's internal representations.
$\partial\mathbb{B}$ nets: learning discrete functions by gradient descent
B nets are differentiable neural networks that learn discrete boolean-valued functions by gradient descent. B nets have two semantically equivalent aspects: a differentiable soft-net, with real weights, and a non-differentiable hard-net, with boolean weights. We train the soft-net by backpropagation and then'harden' the learned weights to yield boolean weights that bind with the hard-net. The result is a learned discrete function. 'Hardening' involves no loss of accuracy, unlike existing approaches to neural network binarization. Preliminary experiments demonstrate that B nets achieve comparable performance on standard machine learning problems yet are compact (due to 1-bit weights) and interpretable (due to the logical nature of the learnt functions). Neural networks are differentiable functions with weights represented by machine floats. Networks are trained by gradient descent in weight-space, where the direction of descent minimises loss. The gradients are efficiently calculated by the backpropagation algorithm (Rumelhart et al., 1986). This overall approach has led to tremendous advances in machine learning.
Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence
Glasgow, Margalit, Wei, Colin, Wootters, Mary, Ma, Tengyu
A major challenge in modern machine learning is theoretically understanding the generalization properties of overparameterized models. Many existing tools rely on uniform convergence (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Nagarajan and Kolter (2019) show that in certain simple linear and neural-network settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails. Our main contribution is proving novel generalization bounds in two such settings, one linear, and one non-linear. We study the linear classification setting of Nagarajan and Kolter, and a quadratic ground truth function learned via a two-layer neural network in the non-linear regime. We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings. Our results show that near-max-margin is important: while any model that achieves at least a $(1 - \epsilon)$-fraction of the max-margin generalizes well, a classifier achieving half of the max-margin may fail terribly. Building on the impossibility results of Nagarajan and Kolter, under slightly stronger assumptions, we show that one-sided UC bounds and classical margin bounds will fail on near-max-margin classifiers. Our analysis provides insight on why memorization can coexist with generalization: we show that in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data. The presence of the overfitting components is enough to preclude UC, but the near-extremal margin guarantees that sufficient generalizable components are present.
How neurons really work is being elucidated
A neuron is a thing of beauty. Ever since Santiago Ramรณn y Cajal stained them with silver nitrate to make them visible under the microscopes of the 1880s (see drawing above), their ramifications have fired the scientific imagination. Ramรณn y Cajal called them the butterflies of the soul. Your browser does not support the audio element. Those ramifications--dendrites by the dozen to collect incoming signals, called action potentials, from other neurons, and a single axon to pass on the summed wisdom of those signals in the form of another action potential, turn neurons into parts of far bigger structures known as neural networks.
A general approach to progressive learning
Vogelstein, Joshua T., Helm, Hayden S., Mehta, Ronak D., Dey, Jayanta, LeVine, Will, Yang, Weiwei, Tower, Bryan, Larson, Jonathan, White, Chris, Priebe, Carey E.
In biological learning, data are used to improve performance simultaneously on the current task, as well as previously encountered and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of progressive learning, whether biological or artificial, is to improve performance on all tasks (including past and future) with any new data. We propose representation ensembling, as opposed to learner ensembling (e.g., bagging), to address progressive learning. We show that representation ensembling -- including representations learned by decision forests or deep network -- uniquely demonstrates improved performance on both past and future tasks in a variety of simulated and real data scenarios, including vision, language, and adversarial tasks, with or without resource constraints. Beyond progressive learning, this work has immediate implications with regards to mitigating batch effects and federated learning applications. We expect a deeper understanding of the mechanisms underlying biological progressive learning to enable further improvements in machine progressive learning.
Copula Representations and Error Surface Projections for the Exclusive Or Problem
The exclusive or (xor) function is one of the simplest examples that illustrate why nonlinear feedforward networks are superior to linear regression for machine learning applications. We review the xor representation and approximation problems and discuss their solutions in terms of probabilistic logic and associative copula functions. After briefly reviewing the specification of feedforward networks, we compare the dynamics of learned error surfaces with different activation functions such as RELU and tanh through a set of colorful three-dimensional charts. The copula representations extend xor from Boolean to real values, thereby providing a convenient way to demonstrate the concept of cross-validation on in-sample and out-sample data sets. Our approach is pedagogical and is meant to be a machine learning prolegomenon. Keywords: machine learning; neural networks; probabilistic logic; copulas; error surfaces; xor.
Consturuct a neural network (multilayer perceptrons) using micro:bit
Let's learn the basics of neural networks using micro:bit. For neural network learning to solve various tasks, back propagation is generally required. Learning with this back propagation requires considerably long time calculations. It is not realistic to do this with such a tiny microbit with low computing power, even though it is not impossible. Therefore, here we will try forward calculation only, using the edge weights of the already learned neural network.