Goto

Collaborating Authors

 linear layer


Composing Linear Layers from Irreducibles

Neural Information Processing Systems

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors--geometric objects encoding oriented planes--and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O log2 d parameters, versus O(d2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.


GRASS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection

Neural Information Processing Systems

Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GRASS, a novel gradient compression algorithm and its variants FACTGRASS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FACTGRASS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines.


Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

Neural Information Processing Systems

The empirical emergence of neural collapse--a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks--has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in a data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use in the literature even beyond data-agnostic settings. Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.


CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices

Neural Information Processing Systems

Normalizing flows are deep generative models that achieve efficient likelihood estimation and sampling through invertible transformations. A key challenge is designing linear layers that enhance expressiveness while enabling efficient computation of the Jacobian determinant and inverse. In this work, we introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition provides a parameter-and computation-efficient formulation, reducing the parameter complexity from O(n2)to O(mn)by using mdiagonal matrices together with m 1circulant matrices, while approximating arbitrary linear transformations. Furthermore, leveraging the Fast Fourier Transform (FFT), our method reduces the time complexity of matrix inversion from O(n3) to O(mnlogn) and matrix log-determinant from O(n3) to O(mn), where n is the input dimension. Building upon this, we introduce a novel normalizing flow model called CirculantDiagonal Flow (CDFlow). Empirical results demonstrate that CDFlow excels in density estimation for natural image datasets and effectively models data with inherent periodicity. In terms of computational efficiency, our method speeds up the matrix inverse and log-determinant computations by 1.17 and 4.31, respectively, compared to the general dense matrix, when the number of channels is set to 96.


Composing Linear Layers from Irreducibles

Neural Information Processing Systems

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: \textit{can we identify/synthesize linear transformations from a minimal set of geometric primitives?} Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors---geometric objects encoding oriented planes---and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only $\mathcal{O}(\log^2 d)$ parameters, versus $\mathcal{O}(d^2)$ required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.


CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices

Neural Information Processing Systems

Normalizing flows are deep generative models that achieve efficient likelihood estimation and sampling through invertible transformations. A key challenge is designing linear layers that enhance expressiveness while enabling efficient computation of the Jacobian determinant and inverse. In this work, we introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition provides a parameter-and computation-efficient formulation, reducing the parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ by using $m$ diagonal matrices together with $m-1$ circulant matrices, while approximating arbitrary linear transformations.Furthermore, leveraging the Fast Fourier Transform (FFT), our method reduces the time complexity of matrix inversion from $\mathcal{O}(n^{3})$ to $\mathcal{O}(mn \log n)$ and matrix log-determinant from $\mathcal{O}(n^{3})$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. Building upon this, we introduce a novel normalizing flow model called Circulant-Diagonal Flow (CDFlow). Empirical results demonstrate that CDFlow excels in density estimation for natural image datasets and effectively models data with inherent periodicity. In terms of computational efficiency, our method speeds up the matrix inverse and log-determinant computations by $1.17\times$ and $4.31\times$, respectively, compared to the general dense matrix, when the number of channels is set to 96.


TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular Potentials

Neural Information Processing Systems

The development of efficient machine learning models for molecular systems representation is becoming crucial in scientific research. We introduce TensorNet, an innovative O(3)-equivariant message-passing neural network architecture that leverages Cartesian tensor representations. By using Cartesian tensor atomic embeddings, feature mixing is simplified through matrix product operations. Furthermore, the cost-effective decomposition of these tensors into rotation group irreducible representations allows for the separate processing of scalars, vectors, and tensors when necessary. Compared to higher-rank spherical tensor models, TensorNet demonstrates state-of-the-art performance with significantly fewer parameters. For small molecule potential energies, this can be achieved even with a single interaction layer. As a result of all these properties, the model's computational cost is substantially decreased. Moreover, the accurate prediction of vector and tensor molecular quantities on top of potential energies and forces is possible. In summary, TensorNet's framework opens up a new space for the design of state-of-the-art equivariant models.