Goto

Collaborating Authors

 entropy estimate


Know Your Limits: Entropy Estimation Modeling for Compression and Generalization

Badger, Benjamin L., Neligeorge, Matthew

arXiv.org Artificial Intelligence

Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.



Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance

Martin, Fermin Moscoso del Prado

arXiv.org Artificial Intelligence

In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate -- theoretically, and empirically -- that a grammar's derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I introduce the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.


Reviews: Entropy Rate Estimation for Markov Chains with Large State Space

Neural Information Processing Systems

The paper proposes an entropy estimate for Markov chains by reduction to optimal entropy estimation for i.i.d samples. Sample complexity analysis is provided for different mixing scenarios with a minimax rate established for a particular rate. The estimator is used to assess the capacity of language models. This is a very clear and well-written paper. I appreciate the efforts done by the authors to summarize the results.


Ensemble weighted kernel estimators for multivariate entropy estimation

Neural Information Processing Systems

The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy.


k-Means Maximum Entropy Exploration

Nedergaard, Alexander, Cook, Matthew

arXiv.org Artificial Intelligence

Exploration in high-dimensional, continuous spaces with sparse rewards is an open problem in reinforcement learning. Artificial curiosity algorithms address this by creating rewards that lead to exploration. Given a reinforcement learning algorithm capable of maximizing rewards, the problem reduces to finding an optimization objective consistent with exploration. Maximum entropy exploration uses the entropy of the state visitation distribution as such an objective. However, efficiently estimating the entropy of the state visitation distribution is challenging in high-dimensional, continuous spaces. We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution. The bound relies on a result we prove for non-parametric density estimation in arbitrary dimensions using k-means. We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces, especially on tasks where reinforcement learning algorithms are unable to find rewards.


Joint Entropy Search for Multi-objective Bayesian Optimization

Tu, Ben, Gandy, Axel, Kantas, Nikolas, Shafei, Behrang

arXiv.org Artificial Intelligence

Many real-world problems can be phrased as a multi-objective optimization problem, where the goal is to identify the best set of compromises between the competing objectives. Multi-objective Bayesian optimization (BO) is a sample efficient strategy that can be deployed to solve these vector-valued optimization problems where access is limited to a number of noisy objective function evaluations. In this paper, we propose a novel information-theoretic acquisition function for BO called Joint Entropy Search (JES), which considers the joint information gain for the optimal set of inputs and outputs. We present several analytical approximations to the JES acquisition function and also introduce an extension to the batch setting.


Confidence intervals for nonparametric regression

Barrera, David

arXiv.org Machine Learning

We demonstrate and discuss nonasymptotic bounds in probability for the cost of a regression scheme with a general loss function from the perspective of the Rademacher theory, and for the optimality with respect to the average $L^{2}$-distance to the underlying conditional expectations of least squares regression outcomes from the perspective of the Vapnik-Chervonenkis theory. The results follow from an analysis involving independent but possibly nonstationary training samples and can be extended, in a manner that we explain and illustrate, to relevant cases in which the training sample exhibits dependence.


Generalization bounds for nonparametric regression with $\beta-$mixing samples

Barrera, David, Gobet, Emmanuel

arXiv.org Machine Learning

In this paper we present a series of results that permit to extend in a direct manner uniform deviation inequalities of the empirical process from the independent to the dependent case characterizing the additional error in terms of $\beta-$mixing coefficients associated to the training sample. We then apply these results to some previously obtained inequalities for independent samples associated to the deviation of the least-squared error in nonparametric regression to derive corresponding generalization bounds for regression schemes in which the training sample may not be independent. These results provide a framework to analyze the error associated to regression schemes whose training sample comes from a large class of $\beta-$mixing sequences, including geometrically ergodic Markov samples, using only the independent case. More generally, they permit a meaningful extension of the Vapnik-Chervonenkis and similar theories for independent training samples to this class of $\beta-$mixing samples.


Entropy from Machine Learning

Janik, Romuald A.

arXiv.org Machine Learning

Subsequently, one can use virtually any machine learning classification algorithm for computing entropy. This procedure can be used to compute entropy, and consequently the free energy directly from a set of Monte Carlo configurations at a given temperature. As a test of the proposed method, using an off-the-shelf machine learning classifier we reproduce the entropy and free energy of the 2D Ising model from Monte Carlo configurations at various temperatures throughout its phase diagram. Other potential applications include computing the entropy of spiking neurons or any other multidimensional binary signals. 1 Introduction The problem of estimating entropy of high dimensional binary configurations or signals is ubiquitous in many disciplines. In physics, we very often have at our disposal a set of configurations of some physical system generated by a Monte Carlo simulation at a given temperature T 0. This data is very much geared towards computing expectation values of various operators or their correlation functions, however obtaining the entropy or free energy of the system is far from trivial. Indeed, to the best of our knowledge, there is no known way to compute the entropy directly from these configurations even for a system of a quite moderate size (e.g. for a 20 20 lattice). The goal of the present paper is to propose machine email: romuald.janik@gmail.com