Gerstner, Wulfram
High-performance deep spiking neural networks with 0.3 spikes per neuron
Stanojevic, Ana, Woźniak, Stanisław, Bellec, Guillaume, Cherubini, Giovanni, Pantazi, Angeliki, Gerstner, Wulfram
Communication by rare, binary spikes is a key factor for the energy efficiency of biological brains. However, it is harder to train biologically-inspired spiking neural networks (SNNs) than artificial neural networks (ANNs). This is puzzling given that theoretical results provide exact mapping algorithms from ANNs to SNNs with time-to-first-spike (TTFS) coding. In this paper we analyze in theory and simulation the learning dynamics of TTFS-networks and identify a specific instance of the vanishing-or-exploding gradient problem. While two choices of SNN mappings solve this problem at initialization, only the one with a constant slope of the neuron membrane potential at threshold guarantees the equivalence of the training trajectory between SNNs and ANNs with rectified linear units. We demonstrate that training deep SNN models achieves the exact same performance as that of ANNs, surpassing previous SNNs on image classification datasets such as MNIST/Fashion-MNIST, CIFAR10/CIFAR100 and PLACES365. Our SNN accomplishes high-performance classification with less than 0.3 spikes per neuron, lending itself for an energy-efficient implementation. We show that fine-tuning SNNs with our robust gradient descent algorithm enables their optimization for hardware implementations with low latency and resilience to noise and quantization.
Should Under-parameterized Student Networks Copy or Average Teacher Weights?
Şimşek, Berfin, Bendjeddou, Amire, Gerstner, Wulfram, Brea, Johanni
Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.
GateON: an unsupervised method for large scale continual learning
Barry, Martin, Bellec, Guillaume, Gerstner, Wulfram
The objective of continual learning (CL) is to learn tasks sequentially without retraining on earlier tasks. However, when subjected to CL, traditional neural networks exhibit catastrophic forgetting and limited generalization. To overcome these problems, we introduce a novel method called 'Gate and Obstruct Network' (GateON). GateON combines learnable gating of activity and online estimation of parameter relevance to safeguard crucial knowledge from being overwritten. Our method generates partially overlapping pathways between tasks which permits forward and backward transfer during sequential learning. GateON addresses the issue of network saturation after parameter fixation by a re-activation mechanism of fixed neurons, enabling large-scale continual learning. GateON is implemented on a wide range of networks (fully-connected, CNN, Transformers), has low computational complexity, effectively learns up to 100 MNIST learning tasks, and achieves top-tier results for pre-trained BERT in CL-based NLP tasks.
MLPGradientFlow: going with the flow of multilayer perceptrons (and finding minima fast and accurately)
Brea, Johanni, Martinelli, Flavio, Şimşek, Berfin, Gerstner, Wulfram
MLPGradientFlow is a software package to solve numerically the gradient flow differential equation $\dot \theta = -\nabla \mathcal L(\theta; \mathcal D)$, where $\theta$ are the parameters of a multi-layer perceptron, $\mathcal D$ is some data set, and $\nabla \mathcal L$ is the gradient of a loss function. We show numerically that adaptive first- or higher-order integration methods based on Runge-Kutta schemes have better accuracy and convergence speed than gradient descent with the Adam optimizer. However, we find Newton's method and approximations like BFGS preferable to find fixed points (local and global minima of $\mathcal L$) efficiently and accurately. For small networks and data sets, gradients are usually computed faster than in pytorch and Hessian are computed at least $5\times$ faster. Additionally, the package features an integrator for a teacher-student setup with bias-free, two-layer networks trained with standard Gaussian input in the limit of infinite data. The code is accessible at https://github.com/jbrea/MLPGradientFlow.jl.
Mesoscopic modeling of hidden spiking neurons
Wang, Shuqi, Schmutz, Valentin, Bellec, Guillaume, Gerstner, Wulfram
Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking optogenetic stimulation.
An Exact Mapping From ReLU Networks to Spiking Neural Networks
Stanojevic, Ana, Woźniak, Stanisław, Bellec, Guillaume, Cherubini, Giovanni, Pantazi, Angeliki, Gerstner, Wulfram
Energy consumption of deep artificial neural networks (ANNs) with thousands of neurons poses a problem not only during training [1], but also during inference [2]. Among other alternatives [3, 4, 5], hardware implementations of spiking neural networks (SNNs) [6, 7, 8, 9, 10] have been proposed as an energy-efficient solution, not only for large centralized applications, but also for computing in edge devices [11, 12, 13]. In SNNs neurons communicate by ultra-short pulses, called action potentials or spikes, that can be considered as point-like events in continuous time. In deep multi-layer SNNs, if a neuron in layer n fires a spike, this event causes a change in the voltage trajectory of neurons in layer n + 1. If, after some time, the trajectory of a neuron in layer n + 1 reaches a threshold value, then this neuron fires a spike. While there is no general consensus on how to best decode spike trains in biology [14, 15, 16], multiple pieces of evidence indicate that immediately after an onset of a stimulus, populations of neurons in auditory, visual, or tactile sensory areas respond in such a way that the timing of the first spike of each neuron after stimulus onset contains a high amount of information about the stimulus features [17, 18, 19]. These and similar observations have triggered the idea that, immediately after stimulus onset, an initial wave of activity is triggered and travels across several brain areas in the sensory processing stream [20, 21, 22, 23, 24]. We take inspiration from these observations and assume in this paper that information is encoded in the exact spike times of each neuron and that spikes are transmitted in a wave-like manner across layers of a deep feedforward neural network. Specifically, we use coding by time-to-first-spike (TTFS) [15], a timing-based code originally proposed in neuroscience [15, 17, 18, 22], which has recently attracted substantial attention in the context of neuromorphic implementations [8, 9, 10, 25, 26, 27, 28, 29, 30].
A taxonomy of surprise definitions
Modirshanechi, Alireza, Brea, Johanni, Gerstner, Wulfram
Surprising events trigger measurable brain activity and influence human behavior by affecting learning, memory, and decision-making. Currently there is, however, no consensus on the definition of surprise. Here we identify 18 mathematical definitions of surprise in a unifying framework. We first propose a technical classification of these definitions into three groups based on their dependence on an agent's belief, show how they relate to each other, and prove under what conditions they are indistinguishable. Going beyond this technical analysis, we propose a taxonomy of surprise definitions and classify them into four conceptual categories based on the quantity they measure: (i) 'prediction surprise' measures a mismatch between a prediction and an observation; (ii) 'change-point detection surprise' measures the probability of a change in the environment; (iii) 'confidence-corrected surprise' explicitly accounts for the effect of confidence; and (iv) 'information gain surprise' measures the belief-update upon a new observation. The taxonomy poses the foundation for principled studies of the functional roles and physiological signatures of surprise in the brain.
Fitting summary statistics of neural data with a differentiable spiking network simulator
Bellec, Guillaume, Wang, Shuqi, Modirshanechi, Alireza, Brea, Johanni, Gerstner, Wulfram
Fitting network models to neural activity is becoming an important tool in neuroscience. A popular approach is to model a brain area with a probabilistic recurrent spiking network whose parameters maximize the likelihood of the recorded activity. Although this is widely used, we show that the resulting model does not produce realistic neural activity and wrongly estimates the connectivity matrix when neurons that are not recorded have a substantial impact on the recorded network. To correct for this, we suggest to augment the log-likelihood with terms that measure the dissimilarity between simulated and recorded activity. This dissimilarity is defined via summary statistics commonly used in neuroscience, and the optimization is efficient because it relies on back-propagation through the stochastically simulated spike trains. We analyze this method theoretically and show empirically that it generates more realistic activity statistics and recovers the connectivity matrix better than other methods.
Towards truly local gradients with CLAPP: Contrastive, Local And Predictive Plasticity
Illing, Bernd, Gerstner, Wulfram, Bellec, Guillaume
Back-propagation (BP) is costly to implement in hardware and implausible as a learning rule implemented in the brain. However, BP is surprisingly successful in explaining neuronal activity patterns found along the cortical processing stream. We propose a locally implementable, unsupervised learning algorithm, CLAPP, which minimizes a simple, layer-specific loss function, and thus does not need to back-propagate error signals. The weight updates only depend on state variables of the pre- and post-synaptic neurons and a layer-wide third factor. Networks trained with CLAPP build deep hierarchical representations of images and speech.
Rescaling, thinning or complementing? On goodness-of-fit procedures for point process models and Generalized Linear Models
Gerhard, Felipe, Gerstner, Wulfram
Generalized Linear Models (GLMs) are an increasingly popular framework for modeling neural spike trains. They have been linked to the theory of stochastic point processes and researchers have used this relation to assess goodness-of-fit using methods from point-process theory, e.g. the time-rescaling theorem. Here, we show how goodness-of-fit tests from point-process theory can still be applied to GLMs by constructing equivalent surrogate point processes out of time-series observations. Furthermore, two additional tests based on thinning and complementing point processes are introduced. They augment the instruments available for checking model adequacy of point processes as well as discretized models.