Goto

Collaborating Authors

 determinant


Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows

Neural Information Processing Systems

Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and singlepass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose a novel framework TarFlowLM, that employs transformer-based autoregressive normalizing flows [73] to model these continuous representations. This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context through stacked, alternating-direction autoregressive transformations, support block-wise generation with flexible token patch sizes, and facilitate a hierarchical multi-pass generation process. We further propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data, and demonstrate theoretical connections to conventional discrete autoregressive models. Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.


CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices

Neural Information Processing Systems

Normalizing flows are deep generative models that achieve efficient likelihood estimation and sampling through invertible transformations. A key challenge is designing linear layers that enhance expressiveness while enabling efficient computation of the Jacobian determinant and inverse. In this work, we introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition provides a parameter-and computation-efficient formulation, reducing the parameter complexity from O(n2)to O(mn)by using mdiagonal matrices together with m 1circulant matrices, while approximating arbitrary linear transformations. Furthermore, leveraging the Fast Fourier Transform (FFT), our method reduces the time complexity of matrix inversion from O(n3) to O(mnlogn) and matrix log-determinant from O(n3) to O(mn), where n is the input dimension. Building upon this, we introduce a novel normalizing flow model called CirculantDiagonal Flow (CDFlow). Empirical results demonstrate that CDFlow excels in density estimation for natural image datasets and effectively models data with inherent periodicity. In terms of computational efficiency, our method speeds up the matrix inverse and log-determinant computations by 1.17 and 4.31, respectively, compared to the general dense matrix, when the number of channels is set to 96.



Algebraic Invariants of Lightning Self-Attention

arXiv.org Machine Learning

We study the polynomial coefficients of lightning self-attention as coordinates of an algebraic variety. We identify linear and nonlinear families of algebraic invariants, including Chow-type, low-rank, Veronese-type, and Sylvester resultant-based constraints.




AInjectiveChange-of-VariableFormulaandStacking InjectiveFlows Wefirstderive(5)from(3). Bythechainrule,wehave: J[gφ ] g

Neural Information Processing Systems

We summarize our methods for computing/estimating the gradient of the log determinant arising inmaximum likelihood training ofrectangular flows. Algorithm 2showstheexactmethod, where jvp(f,z,)denotes computingJ[f](z) usingforward-mode AD,and i Rd isthei-thstandard basis vector, i.e. a one-hot vector with a1 on its i-th coordinate. Note that / θlogdetAθ is computed using backpropagation. Thefor loop is easily parallelized in practice.