temporal structure
StrTransformer: Source-Wise Structured Transformers for Unsupervised Blind Source Recovery
This paper proposes StrTransformer, a source-wise structured Transformer framework for blind source recovery and branch-wise latent modeling. Instead of using an encoder to infer latent variables, StrTransformer directly optimizes the latent source matrix together with an observation-space mixer and source-wise structural Transformer branches. The mixer enforces reconstruction consistency, while each Transformer branch imposes a differentiable structural constraint on one latent source trajectory. Specifically, each source is converted into multi-scale patch tokens, randomly masked, processed by a locality-biased Transformer, and evaluated through a masked patch reconstruction energy. This energy acts as an implicit source-wise structural prior. To encourage different latent branches to specialize into different temporal regimes, StrTransformer further introduces an ordered multi-scale controller that learns branch-specific patch-scale weights, ordered scale centers, and locality attention slopes. The resulting objective combines observation reconstruction, source-wise structural regularization, and modular auxiliary penalties for separation and scale specialization. We analyze the decoupling and coupling structure of the objective, the regularized exact-reconstruction fiber, and the reduction of permutation symmetry induced by ordered branch descriptors. A controlled case study shows that the learned branches converge to distinct temporal-scale structures and recover source-aligned latent trajectories under post-hoc evaluation.
Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis
Linear dimensionality reduction methods are commonly used to extract low-dimensional structure from high-dimensional data. However, popular methods disregard temporal structure, rendering them prone to extracting noise rather than meaningful dynamics when applied to time series data. At the same time, many successful unsupervised learning methods for temporal, sequential and spatial data extract features which are predictive of their surrounding context. Combining these approaches, we introduce Dynamical Components Analysis (DCA), a linear dimensionality reduction method which discovers a subspace of high-dimensional time series data with maximal predictive information, defined as the mutual information between the past and future. We test DCA on synthetic examples and demonstrate its superior ability to extract dynamical structure compared to commonly used linear methods. We also apply DCA to several real-world datasets, showing that the dimensions extracted by DCA are more useful than those extracted by other methods for predicting future states and decoding auxiliary variables.
Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity Analysis
How can we tell whether two neural networks utilize the same internal processes for a particular computation? This question is pertinent for multiple subfields of neuroscience and machine learning, including neuroAI, mechanistic interpretability, and brain-machine interfaces. Standard approaches for comparing neural networks focus on the spatial geometry of latent states. Yet in recurrent networks, computations are implemented at the level of dynamics, and two networks performing the same computation with equivalent dynamics need not exhibit the same geometry. To bridge this gap, we introduce a novel similarity metric that compares two systems at the level of their dynamics, called Dynamical Similarity Analysis (DSA).
Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology
Gahtan, Barak, Bronstein, Alex M.
Deep temporal architectures such as TCNs achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap through three contributions: introducing an evaluation methodology for temporal models, revealing surprising empirical phenomena about temporal dependence, and the first architecture-aware theoretical framework for dependent sequences. Fair-Comparison Methodology. We introduce evaluation protocols that fix effective sample size $N_{\text{eff}}$ to isolate temporal structure effects from information content. Empirical Findings. Applying this method reveals that under $N_{\text{eff}} = 2000$, strongly dependent sequences ($ρ= 0.8$) exhibit approx' $76\%$ smaller generalization gaps than weakly dependent ones ($ρ= 0.2$), challenging the conventional view that dependence universally impedes learning. However, observed convergence rates ($N_{\text{eff}}^{-1.21}$ to $N_{\text{eff}}^{-0.89}$) significantly exceed theoretical worst-case predictions ($N^{-0.5}$), revealing that temporal architectures exploit problem structure in ways current theory does not capture. Lastly, we develop the first architecture-aware generalization bounds for deep temporal models on exponentially $β$-mixing sequences. By embedding Golowich et al.'s i.i.d. class bound within a novel blocking scheme that partitions $N$ samples into approx' $B \approx N/\log N$ quasi-independent blocks, we establish polynomial sample complexity under convex Lipschitz losses. The framework achieves $\sqrt{D}$ depth scaling alongside the product of layer-wise norms $R = \prod_{\ell=1}^{D} M^{(\ell)}$, avoiding exponential dependence. While these bounds are conservative, they prove learnability and identify architectural scaling laws, providing worst-case baselines that highlight where future theory must improve.
Priors in Time: Missing Inductive Biases for Language Model Interpretability
Lubana, Ekdeep Singh, Rager, Can, Hindupur, Sai Sumedh R., Costa, Valerie, Tuckute, Greta, Patel, Oam, Murthy, Sonia Krishna, Fel, Thomas, Wurgaft, Daniel, Bigelow, Eric J., Lin, Johnny, Ba, Demba, Wattenberg, Martin, Viegas, Fernanda, Weber, Melanie, Mueller, Aaron
Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.