Maximum Entropy
Reviews: Maximum-Entropy Fine Grained Classification
This paper presents a simple and effective approach for fine-grained image recognition. The core idea is to introduce max-entropy into loss function, because regular image classification networks often fail to distinguish semantically close visual classes in the feature space. The formulation is clear and the performance is very good in fine-grained tasks. I like the ablation study on CIFAR10/100 and different subsets of ImageNet, showing that this idea really works in classifying fine-grained concepts. The major drawback of this paper lies in its weak technical contribution.
Maximum Entropy Discrimination
We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calcula(cid:173) tions involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed un(cid:173) der this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of class-conditional models within this framework.
A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
We develop a maximum entropy (maxent) approach to generating recom- mendations in the context of a user's current navigation stream, suitable for environments where data is sparse, high-dimensional, and dynamic-- conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probabil- ity that recommendations will cross cluster boundaries and then recom- mending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent frame- work. We conduct experiments on data from ResearchIndex, a popu- lar online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algo- rithms in offline tests simulating the recommendation of documents to ResearchIndex users.
Mistake Bounds for Maximum Entropy Discrimination
We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more refined bound that leads to a nearly opti- mal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds.
Correcting sample selection bias in maximum entropy density estimation
We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias cor- rection approaches. The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one es- timates the biased distribution and then factors the bias out. The third one approximates the second by only using samples from the sampling distri- bution.
Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi's entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.
Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Maximum entropy analysis of binary variables provides an elegant way for study- ing the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory cod- ing, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analy- sis feasible for very high-dimensional data--the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units.
Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models
Training conditional maximum entropy models on massive data requires significant time and computational resources. In this paper, we investigate three common distributed training strategies: distributed gradient, majority voting ensembles, and parameter mixtures. We analyze the worst-case runtime and resource costs of each and present a theoretical foundation for the convergence of parameters under parameter mixtures, the most efficient strategy. We present large-scale experiments comparing the different strategies and demonstrate that parameter mixtures over independent models use fewer resources and achieve comparable loss as compared to standard approaches.
A joint maximum-entropy model for binary neural population patterns and continuous signals
Second-order maximum-entropy models have recently gained much interest for describing the statistics of binary spike trains. Here, we extend this approach to take continuous stimuli into account as well. By constraining the joint second-order statistics, we obtain a joint Gaussian-Boltzmann distribution of continuous stimuli and binary neural firing patterns, for which we also compute marginal and conditional distributions. This model has the same computational complexity as pure binary models and fitting it to data is a convex problem. We show that the model can be seen as an extension to the classical spike-triggered average/covariance analysis and can be used as a non-linear method for extracting features which a neural population is sensitive to.
How biased are maximum entropy models?
Maximum entropy models have become popular statistical models in neuroscience and other areas in biology, and can be useful tools for obtaining estimates of mu- tual information in biological systems. However, maximum entropy models fit to small data sets can be subject to sampling bias; i.e. the true entropy of the data can be severely underestimated. Here we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We show that if the data is generated by a distribution that lies in the model class, the bias is equal to the number of parameters divided by twice the number of observations. However, in practice, the true distribution is usually outside the model class, and we show here that this misspecification can lead to much larger bias. We provide a perturba- tive approximation of the maximally expected bias when the true model is out of model class, and we illustrate our results using numerical simulations of an Ising model; i.e. the second-order maximum entropy distribution on binary data.