Goto

Collaborating Authors

 Schwartz, Odelia


Dissecting Query-Key Interaction in Vision Transformers

arXiv.org Artificial Intelligence

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to use the Singular Value Decomposition to dissect the query-key interaction (i.e. ${\textbf{W}_q}^\top\textbf{W}_k$). We find that early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images.


Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

Neural Information Processing Systems

A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor-like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non-linear correlations between the responses of spatially distributed Gabor-like receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Lis (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion.




Characterizing Neural Gain Control using Spike-triggered Covariance

Neural Information Processing Systems

Spike-triggered averaging techniques are effective for linear characterization of neural responses. But neurons exhibit important nonlinear behaviors, such as gain control, that are not captured by such analyses. We describe a spike-triggered covariance method for retrieving suppressive components of the gain control signal in a neuron. We demonstrate the method in simulation and on retinal ganglion cell data. Analysis of physiological data reveals significant suppressive axes and explains neural nonlinearities. This method should be applicable to other sensory areas and modalities.


Natural Sound Statistics and Divisive Normalization in the Auditory System

Neural Information Processing Systems

We explore the statistical properties of natural sound stimuli preprocessed witha bank of linear filters. The responses of such filters exhibit a striking form of statistical dependency, in which the response variance of each filter grows with the response amplitude of filters tuned for nearby frequencies. These dependencies may be substantially reduced usingan operation known as divisive normalization, in which the response of each filter is divided by a weighted sum of the rectified responsesof other filters. The weights may be chosen to maximize the independence of the normalized responses for an ensemble of natural sounds.We demonstrate that the resulting model accounts for nonlinearities inthe response characteristics of the auditory nerve, by comparing model simulations to electrophysiological recordings. In previous work (NIPS, 1998) we demonstrated that an analogous model derived from the statistics of natural images accounts for nonlinear properties of neurons in primary visual cortex.


Natural Sound Statistics and Divisive Normalization in the Auditory System

Neural Information Processing Systems

We explore the statistical properties of natural sound stimuli preprocessed with a bank of linear filters. The responses of such filters exhibit a striking form of statistical dependency, in which the response variance of each filter grows with the response amplitude of filters tuned for nearby frequencies. These dependencies may be substantially reduced using an operation known as divisive normalization, in which the response of each filter is divided by a weighted sum of the rectified responses of other filters. The weights may be chosen to maximize the independence of the normalized responses for an ensemble of natural sounds. We demonstrate that the resulting model accounts for nonlinearities in the response characteristics of the auditory nerve, by comparing model simulations to electrophysiological recordings.