Goto

Collaborating Authors

 projection


e433e40575f677fb3f7eb7b6b2fb3dd2-Paper-Conference.pdf

Neural Information Processing Systems

We analyze task orderings in continual learning for linear regression, assuming joint realizability of training data. We focus on orderings that greedily maximize dissimilarity between consecutive tasks, a concept briefly explored in prior work but still surrounded by open questions. Using tools from the Kaczmarz method literature, we formalize such orderings and develop geometric and algebraic intuitions around them. Empirically, we demonstrate that greedy orderings converge faster than random ones in terms of the average loss across tasks, both for linear regression with random data and for linear probing on CIFAR-100classification tasks. Analytically, in a high-rank regression setting, we prove a loss bound for greedy orderings analogous to that of random ones. However, under general rank, we establish a repetition-dependent separation. Specifically, while prior work showed that for random orderings, with or without replacement, the average loss after k iterations is bounded by O(1/ k)--we prove that single-pass greedy orderings may fail catastrophically, whereas those allowing repetition converge at rate O(1/ 3 k). Overall, we reveal nuances within and between greedy and random orderings.


1 Appendix 2 AMore Details

Neural Information Processing Systems

Score 0 4 (normal) is most common across cohorts, while score 3 (severe) is rare--especially in PD-GaM 5 and 3DGait, highlighting class imbalance challenges. BMCLab offers a balanced ON/OFF medication split, 7 while E-LC is skewed toward ON-medication. DNE includes healthy, Parkinsonian, and other disease 8 groups for broader contrastive training. Figure A.3 shows label distributions for FoG-related cohorts. This artifact likely stems from the unusual top-down perspective--different from the front15 facing or side views seen in WHAM's training data [1]. While motion encoder-based models may be 16 robust to such distortions, feature-based gait classifiers rely on precise kinematic measurements and 17 thus require carefully corrected input data. To correct this slope artifact, we perform a frame-wise 18 rigid alignment of the reconstructed SMPL skeleton using the Kabsch algorithm [2]. The goal is to 19 rotate each frame so that anatomical directions align with canonical coordinate axes (up, forward), 20 while preserving natural gait structure. This motion 28 vector is then projected onto the ground plane (xz-plane) and used as the walking axis. In frames where the sacrum displacement is less than 30 4mm--indicating near-stationary posture--we fall back on a proxy direction: the cross product of 31 the hip vector (left hip to right hip) and the vertical vector.


Composing Linear Layers from Irreducibles

Neural Information Processing Systems

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors--geometric objects encoding oriented planes--and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O log2 d parameters, versus O(d2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.


Seeds of Structure: Patch PCAReveals Universal Compositional Cues in Diffusion Models

Neural Information Processing Systems

Diffusion models transform random noise into images of remarkable fidelity, yet the structure of this noise-to-image map remains largely unexplored. We investigate this relationship using patch-wise Principal Component Analysis (PCA) and empirically demonstrate that low-frequency components of the initial noise predominantly influence the compositional structure of generated images. Our analyses reveal that noise seeds inherently contain universal compositional cues, evident when identical seeds produce images with similar structural attributes across different datasets and model architectures. Leveraging these insights, we develop and theoretically justify a simple yet effective Patch PCA denoiser that extracts underlying structure from noise using only generic natural image statistics. The robustness of these structural cues is observed to persist across both pixel-space models and latent diffusion models, highlighting their fundamental nature. Finally, we introduce a zero-shot editing method that enables injecting compositional control over generated images, providing an intuitive approach to guided generation without requiring model fine-tuning or additional training.


Foundations of Top-k Decoding for Language Models

Neural Information Processing Systems

Top-kdecoding is a widely used method for sampling from LLMs: at each token, only the largest k next-token-probabilities are kept, and the next token is sampled after renormalizing them to sum to unity. Top-kand other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-k decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-k decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We introduce Bregman decoders obtained by minimizing a separable Bregman divergence (for both the primal and dual cases) with a sparsity-inducing โ„“0-regularization; in particular, these decoders are adaptive in the sense that the sparsity parameter k is chosen depending on the underlying token distribution. Despite the combinatorial nature of the sparse Bregman objective, we show how to optimize it efficiently for a large class of divergences. We prove that (i) the optimal decoding strategies are greedy, and further that (ii) the objective is discretely convex in k, such that the optimal k can be identified in logarithmic time. We note that standard top-k decoding arises as a special case for the KL divergence, and construct new decoding strategies with substantially different behaviors (e.g., non-linearly up-weighting larger probabilities after renormalization).


Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

Neural Information Processing Systems

Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, tr(H), whereas we prove that Adam minimizes tr(Diag(H)1/2) instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.


Who Reasons in the Large Language Models?

Neural Information Processing Systems

Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities--such as mathematical reasoning-- remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in welltrained LLMs are primarily attributed to the output projection module (o_proj) in the Transformer's multi-head self-attention (MHSA) module. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that o_proj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.


Tensor Product Attention Is All You Need

Neural Information Processing Systems

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with RoPE and any possible position encoding mechanisms, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducTATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at the decoding stage enable processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models.


Understanding Contrastive Learning via Gaussian Mixture Models

Neural Information Processing Systems

Contrastive learning involves learning representations via a loss function that encourages each (unlabeled) sample to be far from other samples, but close to its own augmentation. In this paper, we aim to understand why this simple idea performs remarkably well, by theoretically analyzing it for a simple, natural problem setting: dimensionality reduction in Gaussian Mixture Models (GMMs). Note that the standard GMM setup lacks the concept of augmentations. We study an intuitive extension: we define the pair of data sample and its augmentation as a coupled random draw from the GMM such that the marginal over the "noisy" augmentation is biased towards the component of the data sample. For this setup, we show that vanilla contrastive loss, e.g., InfoNCE, is able to find the optimal lower-dimensional subspace even when the Gaussian components are non-isotropic. In particular, we show that InfoNCE can match the performance of a fully supervised algorithm, e.g., LDA, (where each data point is labeled with the mixture component it comes from) even when the augmentations are "noisy". We further extend our setup to the multi-modal case, and develop a GMM-like setting to study the contrastive CLIP loss. We corroborate our theory with experiments on CIFAR100; representations learned by InfoNCE loss match the performance of LDA on clustering metrics.


Enforcing Hard Linear Constraints in Deep Learning Models with Decision Rules

Neural Information Processing Systems

Deep learning models are increasingly deployed in safety-critical tasks where predictions must satisfy hard constraints, such as physical laws, fairness requirements, or safety limits. However, standard architectures lack built-in mechanisms to enforce such constraints, and existing approaches based on regularization or projection are often limited to simple constraints, computationally expensive, or lack feasibility guarantees. This paper proposes a model-agnostic framework for enforcing input-dependent linear equality and inequality constraints on neural network outputs. The architecture combines a task network trained for prediction accuracy with a safe network trained using decision rules from the stochastic and robust optimization literature to ensure feasibility across the entire input space. The final prediction is a convex combination of the two subnetworks, guaranteeing constraint satisfaction during both training and inference without iterative procedures or runtime optimization. We prove that the architecture is a universal approximator of constrained functions and derive computationally tractable formulations based on linear decision rules. Empirical results on benchmark regression tasks show that our method consistently satisfies constraints while maintaining competitive accuracy and low inference latency.