Goto

Collaborating Authors

 Minimum Complexity Machines


Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Shaw, Peter, Cohan, James, Eisenstein, Jacob, Toutanova, Kristina

arXiv.org Artificial Intelligence

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.


Binarized Neural Networks Converge Toward Algorithmic Simplicity: Empirical Support for the Learning-as-Compression Hypothesis

Sakabe, Eduardo Y., Abrahão, Felipe S., Simões, Alexandre, Colombini, Esther, Costa, Paula, Gudwin, Ricardo, Zenil, Hector

arXiv.org Artificial Intelligence

Understanding and controlling the informational complexity of neural networks is a central challenge in machine learning, with implications for generalization, optimization, and model capacity. While most approaches rely on entropy-based loss functions and statistical metrics, these measures often fail to capture deeper, causally relevant algorithmic regularities embedded in network structure. We propose a shift toward algorithmic information theory, using Binarized Neural Networks (BNNs) as a first proxy. Grounded in algorithmic probability (AP) and the universal distribution it defines, our approach characterizes learning dynamics through a formal, causally grounded lens. We apply the Block Decomposition Method (BDM) -- a scalable approximation of algorithmic complexity based on AP -- and demonstrate that it more closely tracks structural changes during training than entropy, consistently exhibiting stronger correlations with training loss across varying model sizes and randomized training runs. These results support the view of training as a process of algorithmic compression, where learning corresponds to the progressive internalization of structured regularities. In doing so, our work offers a principled estimate of learning progression and suggests a framework for complexity-aware learning and regularization, grounded in first principles from information theory, complexity, and computability.


Predictable Compression Failures: Why Language Models Actually Hallucinate

Chlon, Leon, Karim, Ahmed, Chlon, Maggie

arXiv.org Machine Learning

Large language models perform near-Bayesian inference yet violate permutation invariance on exchangeable data. We resolve this by showing transformers minimize expected conditional description length (cross-entropy) over orderings, $\mathbb{E}_π[\ell(Y \mid Γ_π(X))]$, which admits a Kolmogorov-complexity interpretation up to additive constants, rather than the permutation-invariant description length $\ell(Y \mid X)$. This makes them Bayesian in expectation, not in realization. We derive (i) a Quantified Martingale Violation bound showing order-induced deviations scale as $O(\log n)$ with constants; (ii) the Expectation-level Decompression Law linking information budgets to reliability for Bernoulli predicates; and (iii) deployable planners (B2T/RoH/ISR) for answer/abstain decisions. Empirically, permutation dispersion follows $a+b\ln n$ (Qwen2-7B $b \approx 0.377$, Llama-3.1-8B $b \approx 0.147$); permutation mixtures improve ground-truth likelihood/accuracy; and randomized dose-response shows hallucinations drop by $\sim 0.13$ per additional nat. A pre-specified audit with a fixed ISR=1.0 achieves near-0\% hallucinations via calibrated refusal at 24\% abstention. The framework turns hallucinations into predictable compression failures and enables principled information budgeting.


A Minimum Description Length Approach to Regularization in Neural Networks

Abudy, Matan, Well, Orr, Chemla, Emmanuel, Katzir, Roni, Lan, Nur

arXiv.org Artificial Intelligence

State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.


Identifying Causal Direction via Dense Functional Classes

Hlavackova-Schindler, Katerina, Marsela, Suzana

arXiv.org Machine Learning

We address the problem of determining the causal direction between two univariate, continuous-valued variables, X and Y, under the assumption of no hidden confounders. In general, it is not possible to make definitive statements about causality without some assumptions on the underlying model. To distinguish between cause and effect, we propose a bivariate causal score based on the Minimum Description Length (MDL) principle, using functions that possess the density property on a compact real interval. We prove the identifiability of these causal scores under specific conditions. These conditions can be easily tested. Gaussianity of the noise in the causal model equations is not assumed, only that the noise is low. The well-studied class of cubic splines possesses the density property on a compact real interval. We propose LCUBE as an instantiation of the MDL-based causal score utilizing cubic regression splines. LCUBE is an identifiable method that is also interpretable, simple, and very fast. It has only one hyperparameter. Empirical evaluations compared to state-of-the-art methods demonstrate that LCUBE achieves superior precision in terms of AUDRC on the real-world Tuebingen cause-effect pairs dataset. It also shows superior average precision across common 10 benchmark datasets and achieves above average precision on 13 datasets.


Normalized Maximum Likelihood Code-Length on Riemannian Manifold Data Spaces

Fukuzawa, Kota, Suzuki, Atsushi, Yamanishi, Kenji

arXiv.org Artificial Intelligence

--In recent years, with the large-scale expansion of graph data, there has been an increased focus on Riemannian manifold data spaces other than Euclidean space. In particular, the development of hyperbolic spaces has been remarkable, and they have high expressive power for graph data with hierarchical structures. Normalized Maximum Likelihood (NML) is employed in regret minimization and model selection. However, existing formulations of NML have been developed primarily in Euclidean spaces and are inherently dependent on the choice of coordinate systems, making it non-trivial to extend NML to Riemannian manifolds. In this study, we define a new NML that reflects the geometric structure of Riemannian manifolds, called the Riemannian manifold NML (Rm-NML). This Rm-NML is invariant under coordinate transformations and coincides with the conventional NML under the natural parameterization in Euclidean space. We extend existing computational techniques for NML to the setting of Riemannian manifolds. Furthermore, we derive a method to simplify the computation of Rm-NML on Riemannian symmetric spaces, which encompass data spaces of growing interest such as hyperbolic spaces. T o illustrate the practical application of our proposed method, we explicitly computed the Rm-NML for normal distributions on hyperbolic spaces. With the recent increase in the scale of graph data, Riemannian manifold data spaces other than Euclidian spaces are attracting attention as latent spaces suitable for graph embedding [1, 2]. For example, hyperbolic spaces have been demonstrated to possess high expressive power for graph data with hierarchical structures [3]. Spherical spaces are particularly effective in representing graph data with cyclic structures [4]. Notably, research on hyperbolic spaces has been particularly remarkable[3]. Specifically, in the field of representation learning, methods that embed hierarchical structures into hyperbolic space have successfully represented such relationships using significantly lower-dimensional space compared to conventional methods based on Euclidean space, while preserving the essential relational information[2].


drawing connections to Feldman's work (L36), but we agree that the relation between the three topics should be

Neural Information Processing Systems

Thank you all for your thoughtful comments; we address your concerns below. The MDL principle formalizes Occam's razor and is a We will add the discussion of such relevant studies to section 1. We will add these results and accompanying visualizations to appendix. Model (solver) MAC DAFT MAC (euler) DAFT MAC (rk4) DAFT MAC (dopri5; used in training)Time (ms) 153. We found that during evaluation, rk4 solves all the dynamics generated from CLEVR dataset.



A Compression Based Classification Framework Using Symbolic Dynamics of Chaotic Maps

Naik, Parth, B, Harikrishnan N

arXiv.org Artificial Intelligence

We propose a novel classification framework grounded in symbolic dynamics and data compression using chaotic maps. The core idea is to model each class by generating symbolic sequences from thresholded real-valued training data, which are then evolved through a one-dimensional chaotic map. For each class, we compute the transition probabilities of symbolic patterns (e.g., `00', `01', `10', and `11' for the second return map) and aggregate these statistics to form a class-specific probabilistic model. During testing phase, the test data are thresholded and symbolized, and then encoded using the class-wise symbolic statistics via back iteration, a dynamical reconstruction technique. The predicted label corresponds to the class yielding the shortest compressed representation, signifying the most efficient symbolic encoding under its respective chaotic model. This approach fuses concepts from dynamical systems, symbolic representations, and compression-based learning. We evaluate the proposed method: \emph{ChaosComp} on both synthetic and real-world datasets, demonstrating competitive performance compared to traditional machine learning algorithms (e.g., macro F1-scores for the proposed method on Breast Cancer Wisconsin = 0.9531, Seeds = 0.9475, Iris = 0.8469 etc.). Rather than aiming for state-of-the-art performance, the goal of this research is to reinterpret the classification problem through the lens of dynamical systems and compression, which are foundational perspectives in learning theory and information processing.


Single-pass Adaptive Image Tokenization for Minimum Program Search

Duggal, Shivam, Byun, Sanghyun, Freeman, William T., Torralba, Antonio, Isola, Phillip

arXiv.org Artificial Intelligence

According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.