Statistical Learning
A Theoretical and Experimental Comparison of the EM and SEM Algorithm
Blömer, Johannes, Bujna, Kathrin, Kuntze, Daniel
In this paper we provide a new analysis of the SEM algorithm. Unlike previous work, we focus on the analysis of a single run of the algorithm. First, we discuss the algorithm for general mixture distributions. Second, we consider Gaussian mixture models and show that with high probability the update equations of the EM algorithm and its stochastic variant are almost the same, given that the input set is sufficiently large. Our experiments confirm that this still holds for a large number of successive update steps. In particular, for Gaussian mixture models, we show that the stochastic variant runs nearly twice as fast.
Relational Logistic Regression
Kazemi, Seyed Mehran (University of British Columbia) | Buchman, David (University of British Columbia) | Kersting, Kristian (Technical University of Dortmund) | Natarajan, Sriraam (Indiana University) | Poole, David (University of British Columbia)
Logistic regression is a commonly used representation for aggregators in Bayesian belief networks when a child has multiple parents. In this paper we consider extending logistic regression to relational models, where we want to model varying populations and interactions among parents. In this paper, we first examine the representational problems caused by population variation. We show how these problems arise even in simple cases with a single parametrized parent, and propose a linear relational logistic regression which we show can represent arbitrary linear (in population size) decision thresholds, whereas the traditional logistic regression cannot. Then we examine representing interactions among the parents of a child node, and representing non-linear dependency on population size. We propose a multi-parent relational logistic regression which can represent interactions among parents and arbitrary polynomial decision thresholds. Finally, we show how other well-known aggregators can be represented using this relational logistic regression.
Mind the Nuisance: Gaussian Process Classification using Privileged Noise
Hernández-Lobato, Daniel, Sharmanska, Viktoriia, Kersting, Kristian, Lampert, Christoph H., Quadrianto, Novi
The learning with privileged information setting has recently attracted a lot of attention within the machine learning community, as it allows the integration of additional knowledge into the training process of a classifier, even when this comes in the form of a data modality that is not available at test time. Here, we show that privileged information can naturally be treated as noise in the latent function of a Gaussian Process classifier (GPC). That is, in contrast to the standard GPC setting, the latent function is not just a nuisance but a feature: it becomes a natural measure of confidence about the training data by modulating the slope of the GPC sigmoid likelihood function. Extensive experiments on public datasets show that the proposed GPC method using privileged noise, called GPC+, improves over a standard GPC without privileged knowledge, and also over the current state-of-the-art SVM-based method, SVM+. Moreover, we show that advanced neural networks and deep learning methods can be compressed as privileged information.
Rates of Convergence for Nearest Neighbor Classification
Chaudhuri, Kamalika, Dasgupta, Sanjoy
Nearest neighbor methods are a popular class of nonparametric estimators with several desirable properties, such as adaptivity to different distance scales in different regions of space. Prior work on convergence rates for nearest neighbor classification has not fully reflected these subtle properties. We analyze the behavior of these estimators in metric spaces and provide finite-sample, distribution-dependent rates of convergence under minimal assumptions. As a by-product, we are able to establish the universal consistency of nearest neighbor in a broader range of data spaces than was previously known. We illustrate our upper and lower bounds by introducing smoothness classes that are customized for nearest neighbor classification.
Direct Density-Derivative Estimation and Its Application in KL-Divergence Approximation
Sasaki, Hiroaki, Noh, Yung-Kyun, Sugiyama, Masashi
Estimation of density derivatives is a versatile tool in statistical data analysis. A naive approach is to first estimate the density and then compute its derivative. However, such a two-step approach does not work well because a good density estimator does not necessarily mean a good density-derivative estimator. In this paper, we give a direct method to approximate the density derivative without estimating the density itself. Our proposed estimator allows analytic and computationally efficient approximation of multi-dimensional high-order density derivatives, with the ability that all hyper-parameters can be chosen objectively by cross-validation. We further show that the proposed density-derivative estimator is useful in improving the accuracy of non-parametric KL-divergence estimation via metric learning. The practical superiority of the proposed method is experimentally demonstrated in change detection and feature selection.
An Efficient Hybrid CS and K-Means Algorithm for the Capacitated PMedian Problem
Mazinan, Hassan Gholami, Ahmadi, Gholam Reza, Khaji, Erfan
The capacitated P-median problem (CPMP) is an NPcomplete problem which investigates the problem of partitioning a set of N nodes into M different disjoint clusters, each represented by a certain node that is designed as concentrator. The NM nodes that are not chosen as concentrators are referred as terminals. The partitioning of the initial N nodes must be performed in such a way that a measure of total distance between the terminals and their corresponding concentrators is minimized. In addition, a capacity constraint imposed on the concentrators must be met, in order to obtain feasible solutions to the problem [1-4]. A direct application of the CPMP is in the context of communication networks deployment, where a set of terminals in the network must be assigned to the corresponding concentrator in order to compose access networks that satisfy the rate requirements of such terminals [5]. In this context, most of the efforts so far has focused on the topological design of communication networks (e.g. Wireless Sensor Networks (WSN), backbone networks or mobile networks [6-8]) since many of the processes involved in such networks can be approached as a CPMP problem, e.g.
Learning Nonlinear Functions Using Regularized Greedy Forest
We consider the problem of learning a forest of nonlinear decision rules with general loss functions. The standard methods employ boosted decision trees such as Adaboost for exponential loss and Friedman's gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a tree learner as a black box, the method we propose directly learns decision forests via fully-corrective regularized greedy search using the underlying forest structure. Our method achieves higher accuracy and smaller models than gradient boosting (and Adaboost with exponential loss) on many datasets.
Generalized Canonical Correlation Analysis for Classification
Shen, Cencheng, Sun, Ming, Tang, Minh, Priebe, Carey E.
It is common to find collections/measurements of related objects, such as the same article in different languages, similar talks given by different presenters, similar weather patterns in different years, etc. It remains to determine how much the available big data helps us in statistical analysis; simply throwing every collected dataset into the mix may not yield an optimal output. Thus it is natural and important to understand theoretically when and how additional datasets improve the performance of various statistical analysis tasks such as regression, clustering, classification, etc. This is our motivation to explore the following classification problem.
Combining predictions from linear models when training and test inputs differ
Methods for combining predictions from different models in a supervised learning setting must somehow estimate/predict the quality of a model's predictions at unknown future inputs. Many of these methods (often implicitly) make the assumption that the test inputs are identical to the training inputs, which is seldom reasonable. By failing to take into account that prediction will generally be harder for test inputs that did not occur in the training set, this leads to the selection of too complex models. Based on a novel, unbiased expression for KL divergence, we propose XAIC and its special case FAIC as versions of AIC intended for prediction that use different degrees of knowledge of the test inputs. Both methods substantially differ from and may outperform all the known versions of AIC even when the training and test inputs are iid, and are especially useful for deterministic inputs and under covariate shift. Our experiments on linear models suggest that if the test and training inputs differ substantially, then XAIC and FAIC predictively outperform AIC, BIC and several other methods including Bayesian model averaging.
On Soft Power Diagrams
Noname manuscript No. (will be inserted by the editor) Abstract Many applications in data analysis begin with a set of points in a Euclidean space that is partitioned into clusters. Common tasks then are to devise a classifier deciding which of the clusters a new point is associated to, finding outliers with respect to the clusters, or identifying the type of clustering used for the partition. One of the common kinds of clusterings are (balanced) least-squares assignments with respect to a given set of sites. For these, there is a'separating power diagram' for which each cluster lies in its own cell. In the present paper, we aim for efficient algorithms for outlier detection and the computation of thresholds that measure how similar a clustering is to a leastsquares assignment for fixed sites. For this purpose, we devise a new model for the computation of a'soft power diagram', which allows a soft separation of the clusters with'point counting properties'; e.g. As our results hold for a more general non-convex model of free sites, we describe it and our proofs in this more general way. Its locally optimal solutions satisfy the aforementioned point counting properties. For our target applications that use fixed sites, our algorithms are efficiently solvable to global optimality by linear programming.