Goto

Collaborating Authors

 construction


e433e40575f677fb3f7eb7b6b2fb3dd2-Paper-Conference.pdf

Neural Information Processing Systems

We analyze task orderings in continual learning for linear regression, assuming joint realizability of training data. We focus on orderings that greedily maximize dissimilarity between consecutive tasks, a concept briefly explored in prior work but still surrounded by open questions. Using tools from the Kaczmarz method literature, we formalize such orderings and develop geometric and algebraic intuitions around them. Empirically, we demonstrate that greedy orderings converge faster than random ones in terms of the average loss across tasks, both for linear regression with random data and for linear probing on CIFAR-100classification tasks. Analytically, in a high-rank regression setting, we prove a loss bound for greedy orderings analogous to that of random ones. However, under general rank, we establish a repetition-dependent separation. Specifically, while prior work showed that for random orderings, with or without replacement, the average loss after k iterations is bounded by O(1/ k)--we prove that single-pass greedy orderings may fail catastrophically, whereas those allowing repetition converge at rate O(1/ 3 k). Overall, we reveal nuances within and between greedy and random orderings.


Edit Flows: Variable Length Discrete Flow Matching with Sequence-Level Edit Operations

Neural Information Processing Systems

Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations-- insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.


Metric Automata Theory: AUnifying Theory of RNNs

Neural Information Processing Systems

We propose Metric Automata Theory, an elegant generalisation of classic Automata Theory to continuous dynamical systems, that constitutes a unifying theory of all kinds of Recurrent Neural Networks (RNNs), including widely-adopted architectures such as xLSTM and State Space Models (SSMs). The theory allows one to analyse RNNs both in the finite and unbounded precision settings seamlessly, while utilising fundamental results of Automata Theory. It also provides a novel notion of robustness that guarantees numerical stability and contributes to stability of learning. We employ the theory to prove a comprehensive set of expressivity results for widely-adopted RNNs, with a focus on robustness and finite-precision. Notably, we contrast the capabilities of xLSTM and SSMs for robustly modelling all star-free regular languages--xLSTM can do so, while SSMs cannot robustly recognize the FLIP-FLOP language.


See Stonehenge's construction like NEVER before: Incredible visual reveals the vast manpower needed to haul the 25-tonne stones into position 5,000 years ago

Daily Mail - Science & tech

Keir Starmer cries as he quits No 10 claiming a deluded list of'achievements' - now Britain awaits its seventh PM in ten years Putin'prepares mass call-up' for the Ukraine meat-grinder - as video shows Russian veteran with no legs threatening recruiter with a knife in sign of growing resistance facing desperate Kremlin'Al Roker is an absolute ****': KENNEDY's Today show insider gives brutal behind-the-scenes verdict on beloved weatherman and names other two-faced NBC hosts No one can see the real reason Jelly Roll divorced Bunnie XO. Boston's Scotland-loving residents claim England fans are'ruining the vibe' compared to the Tartan Army Secret life of John Travolta's daughter Ella Bleu: New details about'unusual' relationship with her dad revealed by insiders amid fears that aspiring actress is'stuck' Johnny Depp's ex Amber Heard gives rare glimpse of daughter Oonagh, five, after finishing 10k race in Spain Colorado siblings VANISH from home in middle of the night... and police ...


Attention Mechanism, Max-Affine Partition, and Universal Approximation

Neural Information Processing Systems

We establish the universal approximation capability of single-layer, single-head self-and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the L -norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under Lp-norm for 1 p < . Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.


Computational Hardness of Reinforcement Learning with Partial qฯ€-Realizability

Neural Information Processing Systems

This paper investigates the computational complexity of reinforcement learning within a novel linear function approximation regime, termed partial qฯ€-realizability. In this framework, the objective is to learn an ฯต-optimal policy with respect to a predefined policy set ฮ , under the assumption that all value functions corresponding to policies in ฮ  are linearly realizable. This framework adopts assumptions that are weaker than those in the qฯ€-realizability setting yet stronger than those in the q -realizability setup. As a result, it provides a more practical model for reinforcement learning scenarios where function approximation naturally arise. We prove that learning an ฯต-optimal policy in this newly defined setting is computationally hard. More specifically, we establish NP-hardness under a parameterized greedy policy set (i.e., argmax) and, further, show that--unless NP = RP--an exponential lower bound (exponential in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those obtained in the q -realizability settings, and suggest that computational difficulty persists even when the policy class ฮ is expanded beyond the optimal policy, reinforcing the unbreakable nature of the computational hardness result regarding partial qฯ€-realizability under two important policy sets. To establish our negative result, our primary technical contribution is a reduction from two complexity problems, ฮด-MAX-3SAT and ฮด-MAX-3SAT(b), to instances of our problem settings: GLINEAR-ฮบ-RL (under the greedy policy set) and SLINEAR-ฮบ-RL (under the softmax policy set), respectively. Our findings indicate that positive computational results are generally unattainable in the context of partial qฯ€-realizability, in sharp contrast to the qฯ€-realizability setting under a generative access model.


On Union-Closedness of Language Generation

Neural Information Processing Systems

We investigate language generation in the limit - a model by Kleinberg and Mullainathan [2024, NeurIPS] and extended by Li, Raman, and Tewari [2025]. While Kleinberg and Mullainathan proved generation is possible for all countable collections, [Li et al., 2025] defined a hierarchy of generation notions (uniform, non-uniform, and generatable) and explored their feasibility for uncountable collections. Our first set of results resolve two open questions of [Li et al., 2025] by proving finite unions of generatable or non-uniformly generatable classes need not be generatable. These follow from a stronger result: there is a non-uniformly generatable class and a uniformly generatable class whose union is non-generatable. This adds to the aspects along which language generation in the limit is different from traditional tasks in statistical learning theory like classification, which are closed under finite unions. In particular, it implies that given two generators for different collections, one cannot combine them to obtain a single "more powerful" generator, prohibiting this notion of boosting. Our construction also addresses a third of [Li et al., 2025]'s open questions on whether there are uncountable classes that are non-uniformly generatable and do not satisfy the eventually unbounded closure (EUC) condition introduced by Li, Raman, and Tewari. Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation.


ALittle Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

Neural Information Processing Systems

Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer's depth affects its expressive power. We address these questions by analyzing transformers whose depth can grow minimally with context length n. We show even highly uniform transformers with depth ฮ˜(logn) can express two important problems: recognizing regular languages, which captures state tracking abilities and was known to be expressible only by an unconventional, non-uniform model of transformers, and graph connectivity, which underlies multistep reasoning. Notably, both of these problems cannot be expressed by fixed-depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain-of-thought steps. Empirically, our detailed experiments designed to bridge the expressivity vs. learnability gap reveal that our theoretical depth requirements for regular language recognition closely match the practical depth requirements for successfully training transformers. Thus, our results clarify how depth affects a transformer's reasoning capabilities, and provide practical guidance for effective depth selection for sequential reasoning.


Reasoning by Superposition: ATheoretical Perspective on Chain of Continuous Thought

Neural Information Processing Systems

Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-ofthought (CoT) techniques that generate "thinking tokens" before answering the questions. While existing theoretical works demonstrate that CoT with discrete tokens boosts the capability of LLMs, recent work on continuous CoT lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks, such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with D steps of continuous CoT can solve the directed graph reachability problem, where Dis the diameter of the graph, while the best known result of constant-depth transformers with discrete CoT requires O(n2) decoding steps where n is the number of vertices (D < n). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoT must choose a single path sampled from the superposition state, which leads to a sequential search that requires many more steps and may be trapped in local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoT, without explicit supervision to guide the model to explore multiple paths simultaneously.


Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets

Neural Information Processing Systems

We prove rich algebraic structures of the solution space for 2-layer neural networks with quadratic activation and L2 loss, trained on reasoning tasks in Abelian group (e.g., modular addition). Such a rich structure enables analytical construction of global optimal solutions from partial solutions that only satisfy part of the loss, despite its high nonlinearity.