van Dijk, David
Non-Markovian Discrete Diffusion with Causal Language Models
Zhang, Yangtian, He, Sizhuang, Levine, Daniel, Zhao, Lawrence, Zhang, David, Rizvi, Syed A, Zappala, Emanuele, Ying, Rex, van Dijk, David
Discrete diffusion models have emerged as a flexible and controllable paradigm for structured sequence modeling, yet they still lag behind causal language models in expressiveness. To bridge the gap between two paradigms, we introduce CaDDi, a causal discrete diffusion model that unifies sequential and temporal modeling within a non-Markovian diffusion framework. Unlike conventional diffusion models that operate step by step with no access to prior states, CaDDi integrates the temporal trajectory, enabling more expressive and controllable generation. Our approach also treats causal language models as a special case, allowing seamless adoption of pretrained large language models (LLMs) for discrete diffusion without the need for architectural modifications. Empirically, we demonstrate that CaDDi outperforms state-of-the-art discrete diffusion models on both natural language and biological sequence tasks, narrowing the gap between diffusion-based methods and large-scale autoregressive transformers.
COAST: Intelligent Time-Adaptive Neural Operators
Wu, Zhikai, Zhang, Shiyang, He, Sizhuang, Wang, Sifan, Zhu, Min, Jiao, Anran, Lu, Lu, van Dijk, David
We introduce Causal Operator with Adaptive Solver Transformer (COAST), a novel neural operator learning method that leverages a causal language model (CLM) framework to dynamically adapt time steps. Our method predicts both the evolution of a system and its optimal time step, intelligently balancing computational efficiency and accuracy. We find that COAST generates variable step sizes that correlate with the underlying system intrinsicities, both within and across dynamical systems. Within a single trajectory, smaller steps are taken in regions of high complexity, while larger steps are employed in simpler regions. Across different systems, more complex dynamics receive more granular time steps. Benchmarked on diverse systems with varied dynamics, COAST consistently outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. This work underscores the potential of CLM-based intelligent adaptive solvers for scalable operator learning of dynamical systems.
Intelligence at the Edge of Chaos
Zhang, Shiyang, Patel, Aakash, Rizvi, Syed A, Liu, Nianchen, He, Sizhuang, Karbasi, Amin, Zappala, Emanuele, van Dijk, David
We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. Our study focuses on elementary cellular automata (ECA), simple yet powerful one-dimensional systems that generate behaviors ranging from trivial to highly complex. By training distinct Large Language Models (LLMs) on different ECAs, we evaluated the relationship between the complexity of the rules' behavior and the intelligence exhibited by the LLMs, as reflected in their performance on downstream tasks. Our findings reveal that rules with higher complexity lead to models exhibiting greater intelligence, as demonstrated by their performance on reasoning and chess move prediction tasks. Both uniform and periodic systems, and often also highly chaotic systems, resulted in poorer downstream performance, highlighting a sweet spot of complexity conducive to intelligence. We conjecture that intelligence arises from the ability to predict complexity and that creating intelligence may require only exposure to complexity.
CaLMFlow: Volterra Flow Matching using Causal Language Models
He, Sizhuang, Levine, Daniel, Vrkic, Ivan, Bressana, Marco Francesco, Zhang, David, Rizvi, Syed Asad, Zhang, Yangtian, Zappala, Emanuele, van Dijk, David
We introduce CaLMFlow (Causal Language Models for Flow Matching), a novel framework that casts flow matching as a Volterra integral equation (VIE), leveraging the power of large language models (LLMs) for continuous data generation. CaLMFlow enables the direct application of LLMs to learn complex flows by formulating flow matching as a sequence modeling task, bridging discrete language modeling and continuous generative modeling. Our method implements tokenization across space and time, thereby solving a VIE over these domains. This approach enables efficient handling of high-dimensional data and outperforms ODE solver-dependent methods like conditional flow matching (CFM). We demonstrate CaLMFlow's effectiveness on synthetic and real-world data, including single-cell perturbation response prediction, showcasing its ability to incorporate textual context and generalize to unseen conditions. Our results highlight LLM-driven flow matching as a promising paradigm in generative modeling, offering improved scalability, flexibility, and context-awareness. Recent advances in deep learning have revolutionized generative modeling for complex, highdimensional data. In particular, methods based on ordinary differential equations (ODEs), such as continuous normalizing flows (CNFs) (Chen et al., 2018) and flow matching (Lipman et al., 2022), have emerged as efficient tools for modeling continuous data distributions. However, many ODE systems suffer from stiffness making them numerically unstable and computationally expensive to solve accurately (Kushnir & Rokhlin, 2012; Zappala et al., 2024). Recent work in operator learning (Xiong et al., 2021; Cao, 2021; Zappala et al., 2024) has also connected solving integral equations with transformers, the foundational architecture of large language models (LLMs), inspiring the use of LLMs to model dynamical systems through the lens of IEs.
AMPNet: Attention as Message Passing for Graph Neural Networks
Rizvi, Syed Asad, Nguyen, Nhi, Lyu, Haoran, Christensen, Benjamin, Caro, Josue Ortega, Fonseca, Antonio H. O., Zappala, Emanuele, Bagherian, Maryam, Averill, Christopher, Abdallah, Chadi G., Ying, Rex, Brbic, Maria, Dhodapkar, Rahul Madhav, van Dijk, David
Graph Neural Networks (GNNs) have emerged as a powerful representation learning framework for graph-structured data. A key limitation of conventional GNNs is their representation of each node with a singular feature vector, potentially overlooking intricate details about individual node features. Here, we propose an Attention-based Message-Passing layer for GNNs (AMPNet) that encodes individual features per node and models feature-level interactions through cross-node attention during message-passing steps. We demonstrate the abilities of AMPNet through extensive benchmarking on real-world biological systems such as fMRI brain activity recordings and spatial genomic data, improving over existing baselines by 20% on fMRI signal reconstruction, and further improving another 8% with positional embedding added. Finally, we validate the ability of AMPNet to uncover meaningful feature-level interactions through case studies on biological systems. We anticipate that our architecture will be highly applicable to graph-structured data where node entities encompass rich feature-level information.
Operator Learning Meets Numerical Analysis: Improving Neural Networks through Iterative Methods
Zappala, Emanuele, Levine, Daniel, He, Sizhuang, Rizvi, Syed, Levy, Sacha, van Dijk, David
Deep neural networks have become essential tools in domains such as computer vision, natural language processing, and physical system simulations, consistently delivering impressive empirical results. However, a deeper theoretical understanding of these networks remains an open challenge. This study seeks to bridge this gap by examining the connections between deep learning and classical numerical analysis. By interpreting neural networks as operators that transform input functions to output functions, discretized on some grid, we establish parallels with numerical methods designed for operator equations. This approach facilitates a new iterative learning framework for neural networks, inspired by established techniques like the Picard iteration. Our findings indicate that certain prominent architectures, including diffusion models, AlphaFold, and Graph Neural Networks (GNNs), inherently utilize iterative operator learning (see Figure 1). Empirical evaluations show that adopting a more explicit iterative approach in these models can enhance performance. Building on this, we introduce the Picard Iterative Graph Neural Network (PIGN), an iterative GNN model, demonstrating its effectiveness in node classification tasks.
Continuous Spatiotemporal Transformers
Fonseca, Antonio H. de O., Zappala, Emanuele, Caro, Josue Ortega, van Dijk, David
Modeling spatiotemporal dynamical systems is a fundamental challenge in machine learning. Transformer models have been very successful in NLP and computer vision where they provide interpretable representations of data. However, a limitation of transformers in modeling continuous dynamical systems is that they are fundamentally discrete time and space models and thus have no guarantees regarding continuous sampling. To address this challenge, we present the Continuous Spatiotemporal Transformer (CST), a new transformer architecture that is designed for the modeling of continuous systems. This new framework guarantees a continuous and smooth output via optimization in Sobolev space. We benchmark CST against traditional transformers as well as other spatiotemporal dynamics modeling methods and achieve superior performance in a number of tasks on synthetic and real systems, including learning brain dynamics from calcium imaging data.
Neural Integral Equations
Zappala, Emanuele, Fonseca, Antonio Henrique de Oliveira, Caro, Josue Ortega, van Dijk, David
Integral equations (IEs) are equations that model spatiotemporal systems with non-local interactions. They have found important applications throughout theoretical and applied sciences, including in physics, chemistry, biology, and engineering. While efficient algorithms exist for solving given IEs, no method exists that can learn an IE and its associated dynamics from data alone. In this paper, we introduce Neural Integral Equations (NIE), a method that learns an unknown integral operator from data through an IE solver. We also introduce Attentional Neural Integral Equations (ANIE), where the integral is replaced by self-attention, which improves scalability, capacity, and results in an interpretable model. We demonstrate that (A)NIE outperforms other methods in both speed and accuracy on several benchmark tasks in ODE, PDE, and IE systems of synthetic and real-world data.
Neural Integro-Differential Equations
Zappala, Emanuele, Fonseca, Antonio Henrique de Oliveira, Moberly, Andrew Henry, Higley, Michael James, Abdallah, Chadi, Cardin, Jessica, van Dijk, David
Modeling continuous dynamical systems from discretely sampled observations is a fundamental problem in data science. Often, such dynamics are the result of non-local processes that present an integral over time. As such, these systems are modeled with Integro-Differential Equations (IDEs); generalizations of differential equations that comprise both an integral and a differential component. For example, brain dynamics are not accurately modeled by differential equations since their behavior is non-Markovian, i.e. dynamics are in part dictated by history. Here, we introduce the Neural IDE (NIDE), a novel deep learning framework based on the theory of IDEs where integral operators are learned using neural networks. We test NIDE on several toy and brain activity datasets and demonstrate that NIDE outperforms other models. These tasks include time extrapolation as well as predicting dynamics from unseen initial conditions, which we test on whole-cortex activity recordings in freely behaving mice. Further, we show that NIDE can decompose dynamics into their Markovian and non-Markovian constituents via the learned integral operator, which we test on fMRI brain activity recordings of people on ketamine. Finally, the integrand of the integral operator provides a latent space that gives insight into the underlying dynamics, which we demonstrate on wide-field brain imaging recordings. Altogether, NIDE is a novel approach that enables modeling of complex non-local dynamics with neural networks.
Permutation invariant networks to learn Wasserstein metrics
Sehanobish, Arijit, Ravindra, Neal, van Dijk, David
Understanding the space of probability measures on a metric space equipped with a Wasserstein distance is one of the fundamental questions in mathematical analysis. The Wasserstein metric has received a lot of attention in the machine learning community especially for its principled way of comparing distributions. In this work, we use a permutation invariant network to map samples from probability measures into a low-dimensional space such that the Euclidean distance between the encoded samples reflects the Wasserstein distance between probability measures. We show that our network can generalize to correctly compute distances between unseen densities. We also show that these networks can learn the first and the second moments of probability distributions.