Goto

Collaborating Authors

 transition matrix


On Group Sufficiency Under Label Bias

Neural Information Processing Systems

Real-world classification datasets often contain label bias, where observed labels differ systematically from the true labels at different rates for different demographic groups. Machine learning models trained on such datasets may then exhibit disparities in predictive performance across these groups. In this work, we characterize the problem of learning fair classification models with respect to the underlying ground truth labels when given only label biased data. We focus on the particular fairness definition of group sufficiency, i.e. equal calibration of risk scores across protected groups. We theoretically show that enforcing fairness with respect to label biased data necessarily results in group miscalibration with respect to the true labels. We then propose a regularizer which minimizes an upper bound on the sufficiency gap by penalizing a conditional mutual information term. Across experiments on eight tabular, image, and text datasets with both synthetic and real label noise, we find that our method reduces the sufficiency gap by up to 7.2% with no significant decrease in overall accuracy.


Universal Sequence Preconditioning

Neural Information Processing Systems

We study the problem of preconditioning in sequential prediction. From the theoretical lens of linear dynamical systems, we show that convolving the target sequence corresponds to applying a polynomial to the hidden transition matrix. Building on this insight, we propose a universal preconditioning method that convolves the target with coefficients from orthogonal polynomials such as Chebyshev or Legendre. We prove that this approach reduces regret for two distinct prediction algorithms and yields the first ever sublinear and hidden-dimension-independent regret bounds (up to logarithmic factors) that hold for systems with marginally stable and asymmetric transition matrices. Finally, extensive synthetic and realworld experiments show that this simple preconditioning strategy improves the performance of a diverse range of algorithms, including recurrent neural networks, and generalizes to signals beyond linear dynamical systems.


Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

Neural Information Processing Systems

Modern state-space models (SSMs) often utilize structured transition matrices which enable efficient computation but pose restrictions on the model's expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost, even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with provably optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs.


Bi-linearFactored/Block Diag.Bi-linearComplex DiagonalReal DiagonalPositive DiagonalParityArbitraryState MachinesModular AdditionAbelian Groups(e.g., Mamba)

Neural Information Processing Systems

The role of hidden units in recurrent neural networks is typically seen as modeling memory, with research focusing on enhancing information retention through gating mechanisms. A less explored perspective views hidden units as active participants in the computation performed by the network, rather than passive memory stores. In this work, we revisit bilinear operations, which involve multiplicative interactions between hidden units and input embeddings. We demonstrate theoretically and empirically that they constitute a natural inductive bias for representing the evolution of hidden states in state tracking tasks. These are the simplest type of tasks that require hidden units to actively contribute to the behavior of the network. We also show that bilinear state updates form a natural hierarchy corresponding to state tracking tasks of increasing complexity, with popular linear recurrent networks such as Mamba residing at the lowest-complexity center of that hierarchy.


Universal Sequence Preconditioning

Neural Information Processing Systems

We study the problem of preconditioning in the setting of sequential prediction. From the theoretical lens of linear dynamical systems, we show that applying a convolution to the input sequence translates to applying a polynomial to the unknown transition matrix in the hidden space. With this insight, we develop a novel preconditioning method that convolves the input sequence with the coefficients of the Chebyshev or Legendre polynomials. We formally prove that this improves the regret of a wide family of prediction methods. We proceed to apply this preconditioning technique to the method of spectral filtering. This gives the first sublinear regret bound that is also hidden-dimension free (up to logarithmic factors) even when the hidden transition matrix is asymmetric. From rigorous experiments on synthetic data we show that our simple preconditioning method generalizes to both 1) settings where the data is \emph{not} from a linear dynamical system and 2) a broad range of learning algorithms, including recurrent neural networks.


Mamba Modulation: On the Length Generalization of Mamba Models

Neural Information Processing Systems

The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behavior of its state-space dynamics, particularly within the parameterization of the state transition matrix $A$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N{\Delta}_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $A$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $A$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating ${\Delta}_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.