Maximum Entropy
Neural computation from first principles: Using the maximum entropy method to obtain an optimal bits-per-joule neuron
Levy, William B, Berger, Toby, Sungkar, Mustafa
Optimization results are one method for understanding neural computation from Nature's perspective and for defining the physical limits on neuron-like engineering. Earlier work looks at individual properties or performance criteria and occasionally a combination of two, such as energy and information. Here we make use of Jaynes' maximum entropy method and combine a larger set of constraints, possibly dimensionally distinct, each expressible as an expectation. The method identifies a likelihood-function and a sufficient statistic arising from each such optimization. This likelihood is a first-hitting time distribution in the exponential class. Particular constraint sets are identified that, from an optimal inference perspective, justify earlier neurocomputational models. Interactions between constraints, mediated through the inferred likelihood, restrict constraint-set parameterizations, e.g., the energy-budget limits estimation performance which, in turn, matches an axonal communication constraint. Such linkages are, for biologists, experimental predictions of the method. In addition to the related likelihood, at least one type of constraint set implies marginal distributions, and in this case, a Shannon bits/joule statement arises.
Latent Laplacian Maximum Entropy Discrimination for Detection of High-Utility Anomalies
Hou, Elizabeth, Sricharan, Kumar, Hero, Alfred O.
Data-driven anomaly detection methods suffer from the drawback of detecting all instances that are statistically rare, irrespective of whether the detected instances have real-world significance or not. In this paper, we are interested in the problem of specifically detecting anomalous instances that are known to have high real-world utility, while ignoring the low-utility statistically anomalous instances. To this end, we propose a novel method called Latent Laplacian Maximum Entropy Discrimination (LatLapMED) as a potential solution. This method uses the EM algorithm to simultaneously incorporate the Geometric Entropy Minimization principle for identifying statistical anomalies, and the Maximum Entropy Discrimination principle to incorporate utility labels, in order to detect high-utility anomalies. We apply our method in both simulated and real datasets to demonstrate that it has superior performance over existing alternatives that independently pre-process with unsupervised anomaly detection algorithms before classifying.
Computing Maximum Entropy Distributions Everywhere
Straszak, Damian, Vishnoi, Nisheeth K.
We study the problem of computing the maximum entropy distribution with a specified expectation over a large discrete domain. Maximum entropy distributions arise and have found numerous applications in economics, machine learning and various sub-disciplines of mathematics and computer science. The key computational questions related to maximum entropy distributions are whether they have succinct descriptions and whether they can be efficiently computed. Here we provide positive answers to both of these questions for very general domains and, importantly, with no restriction on the expectation. This completes the picture left open by the prior work on this problem which requires that the expectation vector is polynomially far in the interior of the convex hull of the domain. As a consequence we obtain a general algorithmic tool and show how it can be applied to derive several old and new results in a unified manner. In particular, our results imply that certain recent continuous optimization formulations, for instance, for discrete counting and optimization problems, the matrix scaling problem, and the worst case Brascamp-Lieb constants in the rank-1 regime, are efficiently computable. Attaining these implications requires reformulating the underlying problem as a version of maximum entropy computation where optimization also involves the expectation vector and, hence, cannot be assumed to be sufficiently deep in the interior. The key new technical ingredient in our work is a polynomial bound on the bit complexity of near-optimal dual solutions to the maximum entropy convex program. This result is obtained by a geometrical reasoning that involves convex analysis and polyhedral geometry, avoiding combinatorial arguments based on the specific structure of the domain. We also provide a lower bound on the bit complexity of near-optimal solutions showing the tightness of our results.
Quantifying multivariate redundancy with maximum entropy decompositions of mutual information
Williams and Beer (2010) proposed a nonnegative mutual information decomposition, based on the construction of redundancy lattices, which allows separating the information that a set of variables contains about a target variable into nonnegative components interpretable as the unique information of some variables not contained in others as well as redundant and synergistic components. However, the definition of multivariate measures of redundancy that comply with nonnegativity and conform to certain axioms that capture conceptually desirable properties of redundancy has proven to be elusive. We here present a procedure to determine multivariate redundancy measures, within the framework of maximum entropy models. In particular, we generalize existing bivariate maximum entropy-based measures of redundancy and unique information, defining measures of the redundant information that a group of variables has about a target, and of the unique redundant information that a group of variables has about a target that is not redundant with information from another group. The two key ingredients for this approach are: First, the identification of a type of constraints on entropy maximization that allows isolating components of redundancy and unique redundancy by mirroring them to synergy components. Second, the construction of rooted tree-based decompositions to breakdown mutual information, ensuring nonnegativity by the local implementation of maximum entropy information projections at each binary unfolding of the tree nodes. Altogether, the proposed measures are nonnegative and conform to the desirable axioms for redundancy measures.
Regression, Logistic Regression and Maximum Entropy
One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. Classification is used to make an accurate prediction of the class of entries in the test set (a dataset of which the entries have not been labelled yet) with the model which was constructed from a training set. You could think of classifying crime in the field of Pre-Policing, classifying patients in the Health sector, classifying houses in the Real-Estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This is the field of science with the goal to makes machines (computers) understand (written) human language.
Maximum Entropy Flow Networks
Loaiza-Ganem, Gabriel, Gao, Yuanjun, Cunningham, John P.
Maximum entropy modeling is a flexible and popular framework for formulating statistical models given partial knowledge. In this paper, rather than the traditional method of optimizing over the continuous density directly, we learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. Doing so is nontrivial in that the objective being maximized (entropy) is a function of the density itself. By exploiting recent developments in normalizing flow networks, we cast the maximum entropy problem into a finite-dimensional constrained optimization, and solve the problem by combining stochastic optimization with the augmented Lagrangian method. Simulation results demonstrate the effectiveness of our method, and applications to finance and computer vision show the flexibility and accuracy of using maximum entropy flow networks.
Dual-Clustering Maximum Entropy with Application to Classification and Word Embedding
Wang, Xiaolong (University of Illinois ) | Wang, Jingjing (University of Illinois) | Zhai, Chengxiang (University of Illinois)
Maximum Entropy (ME), as a general-purpose machine learning model, has been successfully applied to various fields such as text mining and natural language processing. It has been used as a classification technique and recently also applied to learn word embedding. ME establishes a distribution of the exponential form over items (classes/words). When training such a model, learning efficiency is guaranteed by globally updating the entire set of model parameters associated with all items at each training instance. This creates a significant computational challenge when the number of items is large. To achieve learning efficiency with affordable computational cost, we propose an approach named Dual-Clustering Maximum Entropy (DCME). Exploiting the primal-dual form of ME, it conducts clustering in the dual space and approximates each dual distribution by the corresponding cluster center. This naturally enables a hybrid online-offline optimization algorithm whose time complexity per instance only scales as the product of the feature/word vector dimensionality and the cluster number. Experimental studies on text classification and word embedding learning demonstrate that DCME effectively strikes a balance between training speed and model quality, substantially outperforming state-of-the-art methods.
Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods
The well known maximum-entropy principle due to Jaynes, which states that given mean parameters, the maximum entropy distribution matching them is in an exponential family has been very popular in machine learning due to its “Occam’s razor” interpretation. Unfortunately, calculating the potentials in the maximum entropy distribution is intractable [BGS14]. We provide computationally efficient versions of this principle when the mean parameters are pairwise moments: we design distributions that approximately match given pairwise moments, while having entropy which is comparable to the maximum entropy distribution matching those moments. We additionally provide surprising applications of the approximate maximum entropy principle to designing provable variational methods for partition function calculations for Ising models without any assumptions on the potentials of the model. More precisely, we show that we can get approximation guarantees for the log-partition function comparable to those in the low-temperature limit, which is the setting of optimization of quadratic forms over the hypercube. ([AN06])
Maximum entropy models capture melodic styles
Sakellariou, Jason, Tria, Francesca, Loreto, Vittorio, Pachet, François
We introduce a Maximum Entropy model able to capture the statistics of melodies in music. The model can be used to generate new melodies that emulate the style of the musical corpus which was used to train it. Instead of using the $n-$body interactions of $(n-1)-$order Markov models, traditionally used in automatic music generation, we use a $k-$nearest neighbour model with pairwise interactions only. In that way, we keep the number of parameters low and avoid over-fitting problems typical of Markov models. We show that long-range musical phrases don't need to be explicitly enforced using high-order Markov interactions, but can instead emerge from multiple, competing, pairwise interactions. We validate our Maximum Entropy model by contrasting how much the generated sequences capture the style of the original corpus without plagiarizing it. To this end we use a data-compression approach to discriminate the levels of borrowing and innovation featured by the artificial sequences. The results show that our modelling scheme outperforms both fixed-order and variable-order Markov models. This shows that, despite being based only on pairwise interactions, this Maximum Entropy scheme opens the possibility to generate musically sensible alterations of the original phrases, providing a way to generate innovation.
Maximum Entropy Vector Kernels for MIMO system identification
Prando, Giulia, Pillonetto, Gianluigi, Chiuso, Alessandro
Recent contributions have framed linear system identification as a nonparametric regularized inverse problem. Relying on $\ell_2$-type regularization which accounts for the stability and smoothness of the impulse response to be estimated, these approaches have been shown to be competitive w.r.t classical parametric methods. In this paper, adopting Maximum Entropy arguments, we derive a new $\ell_2$ penalty deriving from a vector-valued kernel; to do so we exploit the structure of the Hankel matrix, thus controlling at the same time complexity, measured by the McMillan degree, stability and smoothness of the identified models. As a special case we recover the nuclear norm penalty on the squared block Hankel matrix. In contrast with previous literature on reweighted nuclear norm penalties, our kernel is described by a small number of hyper-parameters, which are iteratively updated through marginal likelihood maximization; constraining the structure of the kernel acts as a (hyper)regularizer which helps controlling the effective degrees of freedom of our estimator. To optimize the marginal likelihood we adapt a Scaled Gradient Projection (SGP) algorithm which is proved to be significantly computationally cheaper than other first and second order off-the-shelf optimization methods. The paper also contains an extensive comparison with many state-of-the-art methods on several Monte-Carlo studies, which confirms the effectiveness of our procedure.