ar process
A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
Gaitonde, Jason, Koehler, Frederic, Mossel, Elchanan, Shin, Joonhyung, Sly, Allan
We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an $Ω(n)$ lower bound on the context length required to faithfully sample length-$n$ sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only $Θ(\log n)$ working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.
SAMoSSA: Multivariate Singular Spectrum Analysis with Stochastic Autoregressive Noise
The well-established practice of time series analysis involves estimating deterministic, non-stationary trend and seasonality components followed by learning the residual stochastic, stationary components. Recently, it has been shown that one can learn the deterministic non-stationary components accurately using multivariate Singular Spectrum Analysis (mSSA) in the absence of a correlated stationary component; meanwhile, in the absence of deterministic non-stationary components, the Autoregressive (AR) stationary component can also be learnt readily, e.g.
Supplementary Material for ' Causality Preserving Chaotic Transformation and Classification using Neurochaos Learning '
This is the supplementary information pertaining to the main manuscript. In this supplementary material, we provide the comparative performance of Neurochaos Learning with Deep Neural Network, 1DConvolutional Neural Network (1D CNN), and Long Short term Memory (LSTM) for evaluation of cause-effect classification of timeseries data generated from coupled chaotic master-slave system and autoregressive (AR) processes. We also check whether each of these architectures are able to preserve cause-effect relationship between the corresponding features extracted from the original cause and effect time series. To evaluate the efficacy of Neurochaos Learning (NL: ChaosNet) and deep learning algorithms for the classification of cause-effect, we used simulated datasets from (a) coupled autoregressive (AR) processes, and (b) coupled 1D chaotic skew tent-maps in master-slave configuration. The governing equations for the coupled AR processes are the following: M(t)=a1M(t 1)+γr(t), (1) S(t)=a2S(t 1)+ηM(t 1)+γr(t), (2) where M(t) and S(t) are the independent and the dependent (or the cause and effect) time series respectively; a1 = 0.8, a2 = 0.9, the noise intensity γ = 0.03 and r(t) is independent and identically distributed additive Gaussian noise drawn from a standard normal distribution.
Causality Preserving Chaotic Transformation and Classification using Neurochaos Learning
Discovering cause and effect variables from observational data is an important but challenging problem in science and engineering. In this work, a recently proposed brain inspired learning algorithm namely-Neurochaos Learning (NL) is used for the classification of cause and effect time series generated using coupled autoregressive processes, coupled 1D chaotic skew tent maps, coupled 1D chaotic logistic maps and a real-world prey-predator system. In the case of coupled skew tent maps, the proposed method consistently outperforms a five layer Deep Neural Network (DNN) and Long Short Term Memory (LSTM) architecture for unidirectional coupling coefficient values ranging from 0.1 to 0.7. Further, we investigate the preservation of causality in the feature extracted space of NL using Granger Causality for coupled autoregressive processes and Compression-Complexity Causality for coupled chaotic systems and real-world prey-predator dataset. Unlike DNN, LSTM and 1DConvolutional Neural Network, it is found that NL preserves the inherent causal structures present in the input timeseries data. These findings are promising for the theory and applications of causal machine learning and open up the possibility to explore the potential of NL for more sophisticated causal learning tasks.
Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling
Wang, Yongyi, Li, Lingfeng, Chen, Bozhou, Li, Ang, Liu, Hanyu, Zheng, Qirui, Yang, Xionghui, Li, Wenxin
Recent research has developed benchmarks for memory-augmented reinforcement learning (RL) algorithms, providing Partially Observable Markov Decision Process (POMDP) environments where agents depend on past observations to make decisions. While many benchmarks incorporate sufficiently complex real-world problems, they lack controllabil-ity over the degree of challenges posed to memory models. In contrast, synthetic environments enable fine-grained manipulation of dynamics, making them critical for detailed and rigorous evaluation of memory-augmented RL. Our study focuses on POMDP synthesis with three key contributions: 1. A theoretical framework for analyzing POMDPs, grounded in Memory Demand Structure (MDS), transition invariance, and related concepts; 2. A methodology leveraging linear process dynamics, state aggregation, and reward redistribution to construct customized POMDPs with predefined properties; 3. Empirically validated series of POMDP environments with increasing difficulty levels, designed based on our theoretical insights. Our work clarifies the challenges of memory-augmented RL in solving POMDPs, provides guidelines for analyzing and designing POMDP environments, and offers empirical support for selecting memory models in RL tasks.
Identifiable Autoregressive Variational Autoencoders for Nonlinear and Nonstationary Spatio-Temporal Blind Source Separation
Sipilä, Mika, Nordhausen, Klaus, Taskinen, Sara
The modeling and prediction of multivariate spatio-temporal data involve numerous challenges. Dimension reduction methods can significantly simplify this process, provided that they account for the complex dependencies between variables and across time and space. Nonlinear blind source separation has emerged as a promising approach, particularly following recent advances in identifiability results. Building on these developments, we introduce the identifiable autoregressive variational autoen-coder, which ensures the identifiability of latent components consisting of nonstationary autoregressive processes. The blind source separation efficacy of the proposed method is showcased through a simulation study, where it is compared against state-of-the-art methods, and the spatio-temporal prediction performance is evaluated against several competitors on air pollution and weather datasets.
How do Transformers perform In-Context Autoregressive Learning?
Sander, Michael E., Giryes, Raja, Suzuki, Taiji, Blondel, Mathieu, Peyré, Gabriel
Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a simple next token prediction task, where sequences are generated as a first-order autoregressive process $s_{t+1} = W s_t$. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping. We call the resulting procedure in-context autoregressive learning. More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens. When the tokens are not augmented, we characterize the global minima of a one-layer diagonal linear multi-head Transformer. Importantly, we exhibit orthogonality between heads and show that positional encoding captures trigonometric relations in the data. On the experimental side, we consider the general case of non-commuting orthogonal matrices and generalize our theoretical findings.