Minimum Complexity Machines
HyperVAE: A Minimum Description Length Variational Hyper-Encoding Network
Nguyen, Phuoc, Tran, Truyen, Gupta, Sunil, Rana, Santu, Dam, Hieu-Chi, Venkatesh, Svetha
We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
Descriptive Dimensionality and Its Characterization of MDL-based Learning and Change Detection
This paper introduces a new notion of dimensionality of probabilistic models from an information-theoretic view point. We call it the "descriptive dimension"(Ddim). We show that Ddim coincides with the number of independent parameters for the parametric class, and can further be extended to real-valued dimensionality when a number of models are mixed. The paper then derives the rate of convergence of the MDL (Minimum Description Length) learning algorithm which outputs a normalized maximum likelihood (NML) distribution with model of the shortest NML codelength. The paper proves that the rate is governed by Ddim. The paper also derives error probabilities of the MDL-based test for multiple model change detection. It proves that they are also governed by Ddim. Through the analysis, we demonstrate that Ddim is an intrinsic quantity which characterizes the performance of the MDL-based learning and change detection.
An MDL-Based Classifier for Transactional Datasets with Application in Malware Detection
Asadi, Behzad, Varadharajan, Vijay
We design a classifier for transactional datasets with application in malware detection. We build the classifier based on the minimum description length (MDL) principle. This involves selecting a model that best compresses the training dataset for each class considering the MDL criterion. To select a model for a dataset, we first use clustering followed by closed frequent pattern mining to extract a subset of closed frequent patterns (CFPs). We show that this method acts as a pattern summarization method to avoid pattern explosion; this is done by giving priority to longer CFPs, and without requiring to extract all CFPs. We then use the MDL criterion to further summarize extracted patterns, and construct a code table of patterns. This code table is considered as the selected model for the compression of the dataset. We evaluate our classifier for the problem of static malware detection in portable executable (PE) files. We consider API calls of PE files as their distinguishing features. The presence-absence of API calls forms a transactional dataset. Using our proposed method, we construct two code tables, one for the benign training dataset, and one for the malware training dataset. Our dataset consists of 19696 benign, and 19696 malware samples, each a binary sequence of size 22761. We compare our classifier with deep neural networks providing us with the state-of-the-art performance. The comparison shows that our classifier performs very close to deep neural networks. We also discuss that our classifier is an interpretable classifier. This provides the motivation to use this type of classifiers where some degree of explanation is required as to why a sample is classified under one class rather than the other class.
Machine Learning, Kolmogorov Complexity, and Squishy Bunnies
We know that Machine Learning is an extremely powerful tool for tackling complex problems which we don't know how to solve by conventional means. Problems like image classification can be solved effectively by Machine Learning because at the end of the day, gathering data for that kind of task is much easier than coming up with hand-written rules for such a complex and difficult problem. But what about problems we already know how to solve? Is there any reason to apply Machine Learning to problems we already have working solutions for? Tasks such as physics simulation, where the rules and equations governing the task are already well known and explored?
Minimum Description Length Revisited
This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {\em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective.
Identifying Linear Models in Multi-Resolution Population Data using Minimum Description Length Principle to Predict Household Income
Amornbunchornvej, Chainarong, Surasvadi, Navaporn, Plangprasopchok, Anon, Thajchayapong, Suttipong
One shirt size cannot fit everybody, while we cannot make a unique shirt that fits perfectly for everyone because of resource limitation. This analogy is true for the policy making. Policy makers cannot establish a single policy to solve all problems for all regions because each region has its own unique issue. In the other extreme, policy makers also cannot create a policy for each small village due to the resource limitation. Would it be better if we can find a set of largest regions such that the population of each region within this set has common issues and we can establish a single policy for them? In this work, we propose a framework using regression analysis and minimum description length (MDL) to find a set of largest areas that have common indicators, which can be used to predict household incomes efficiently. Given a set of household features, and a multi-resolution partition that represents administrative divisions, our framework reports a set C* of largest subdivisions that have a common model for population-income prediction. We formalize a problem of finding C* and propose the algorithm as a solution. We use both simulation datasets as well as a real-world dataset of Thailand's population household information to demonstrate our framework performance and application. The results show that our framework performance is better than the baseline methods. We show the results of our method can be used to find indicators of income prediction for many areas in Thailand. By increasing these indicator values, we expect people in these areas to gain more incomes. Hence, the policy makers can plan to establish the policies by using these indicators in our results as a guideline to solve low-income issues. Our framework can be used to support policy makers to establish policies regarding any other dependent variable beyond incomes in order to combat poverty and other issues.
Quantifying Confounding Bias in Neuroimaging Datasets with Causal Inference
Wachinger, Christian, Becker, Benjamin Gutierrez, Rieckmann, Anna, Pölsterl, Sebastian
Neuroimaging datasets keep growing in size to address increasingly complex medical questions. However, even the largest datasets today alone are too small for training complex machine learning models. A potential solution is to increase sample size by pooling scans from several datasets. In this work, we combine 12,207 MRI scans from 15 studies and show that simple pooling is often ill-advised due to introducing various types of biases in the training data. First, we systematically define these biases. Second, we detect bias by experimentally showing that scans can be correctly assigned to their respective dataset with 73.3% accuracy. Finally, we propose to tell causal from confounding factors by quantifying the extent of confounding and causality in a single dataset using causal inference. We achieve this by finding the simplest graphical model in terms of Kolmogorov complexity. As Kolmogorov complexity is not directly computable, we employ the minimum description length to approximate it. We empirically show that our approach is able to estimate plausible causal relationships from real neuroimaging data.
Inferring Latent dimension of Linear Dynamical System with Minimum Description Length
Time-invariant linear dynamical system arises in many real-world applications,and its usefulness is widely acknowledged. A practical limitation with this model is that its latent dimension that has a large impact on the model capability needs to be manually specified. It can be demonstrated that a lower-order model class could be totally nested into a higher-order class, and the corresponding likelihood is nondecreasing. Hence, criterion built on the likelihood is not appropriate for model selection. This paper addresses the issue and proposes a criterion for linear dynamical system based on the principle of minimum description length. The latent structure, which is omitted in previous work, is explicitly considered in this newly proposed criterion. Our work extends the principle of minimum description length and demonstrates its effectiveness in the tasks of model training. The experiments on both univariate and multivariate sequences confirm the good performance of our newly proposed method.
A Novel Hyperparameter-free Approach to Decision Tree Construction that Avoids Overfitting by Design
Leiva, Rafael Garcia, Anta, Antonio Fernandez, Mancuso, Vincenzo, Casari, Paolo
Decision trees are an extremely popular machine learning technique. Unfortunately, overfitting in decision trees still remains an open issue that sometimes prevents achieving good performance. In this work, we present a novel approach for the construction of decision trees that avoids the overfitting by design, without losing accuracy. A distinctive feature of our algorithm is that it requires neither the optimization of any hyperparameters, nor the use of regularization techniques, thus significantly reducing the decision tree training time. Moreover, our algorithm produces much smaller and shallower trees than traditional algorithms, facilitating the interpretability of the resulting models.
Interpretable multiclass classification by MDL-based rule lists
Proença, Hugo M., van Leeuwen, Matthijs
Interpretable classifiers have recently witnessed an increase in attention from the data mining community because they are inherently easier to understand and explain than their more complex counterparts. Examples of interpretable classification models include decision trees, rule sets, and rule lists. Learning such models often involves optimizing hyperparameters, which typically requires substantial amounts of data and may result in relatively large models. In this paper, we consider the problem of learning compact yet accurate probabilistic rule lists for multiclass classification. Specifically, we propose a novel formalization based on probabilistic rule lists and the minimum description length (MDL) principle. This results in virtually parameter-free model selection that naturally allows to trade-off model complexity with goodness of fit, by which overfitting and the need for hyperparameter tuning are effectively avoided. Finally, we introduce the Classy algorithm, which greedily finds rule lists according to the proposed criterion. We empirically demonstrate that Classy selects small probabilistic rule lists that outperform state-of-the-art classifiers when it comes to the combination of predictive performance and interpretability. We show that Classy is insensitive to its only parameter, i.e., the candidate set, and that compression on the training set correlates with classification performance, validating our MDL-based selection criterion.