Genre
Optimal Time Bounds for Approximate Clustering
Mettu, Ramgopal, Plaxton, Greg
Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the emph{k-median} objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call emph{successive sampling} that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klog{n/k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Omega(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1/100) probability. Thus we establish a tight time bound of Theta(nk) for the k-median problem for a wide range of values of k. The best previous upper bound for the problem was O(nk), where the O-notation hides polylogarithmic factors in n and k. The best previous lower bound of O(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.
Staged Mixture Modelling and Boosting
Meek, Christopher, Thiesson, Bo, Heckerman, David
In this paper, we introduce and evaluate a data-driven staged mixture modeling technique for building density, regression, and classification models. Our basic approach is to sequentially add components to a finite mixture model using the structural expectation maximization (SEM) algorithm. We show that our technique is qualitatively similar to boosting. This correspondence is a natural byproduct of the fact that we use the SEM algorithm to sequentially fit the mixture model. Finally, in our experimental evaluation, we demonstrate the effectiveness of our approach on a variety of prediction and density estimation tasks using real-world data.
Almost-everywhere algorithmic stability and generalization error
We explore in some detail the notion of algorithmic stability as a viable framework for analyzing the generalization error of learning algorithms. We introduce the new notion of training stability of a learning algorithm and show that, in a general setting, it is sufficient for good bounds on generalization error. In the PAC setting, training stability is both necessary and sufficient for learnability. The approach based on training stability makes no reference to VC dimension or VC entropy. There is no need to prove uniform convergence, and generalization error is bounded directly via an extended McDiarmid inequality. As a result it potentially allows us to deal with a broader class of learning algorithms than Empirical Risk Minimization. We also explore the relationships among VC dimension, generalization error, and various notions of stability. Several examples of learning algorithms are considered.
Dimension Correction for Hierarchical Latent Class Models
Kocka, Tomas, Zhang, Nevin Lianwen
Model complexity is an important factor to consider when selecting among graphical models. When all variables are observed, the complexity of a model can be measured by its standard dimension, i.e. the number of independent parameters. When hidden variables are present, however, standard dimension might no longer be appropriate. One should instead use effective dimension (Geiger et al. 1996). This paper is concerned with the computation of effective dimension. First we present an upper bound on the effective dimension of a latent class (LC) model. This bound is tight and its computation is easy. We then consider a generalization of LC models called hierarchical latent class (HLC) models (Zhang 2002). We show that the effective dimension of an HLC model can be obtained from the effective dimensions of some related LC models. We also demonstrate empirically that using effective dimension in place of standard dimension improves the quality of models learned from data.
An Information-Theoretic External Cluster-Validity Measure
In this paper we propose a measure of clustering quality or accuracy that is appropriate in situations where it is desirable to evaluate a clustering algorithm by somehow comparing the clusters it produces with ``ground truth' consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are refered to as ``external'. Our measure also has the characteristic of allowing clusterings with different numbers of clusters to be compared in a quantitative and principled way. Our evaluation scheme quantitatively measures how useful the cluster labels of the patterns are as predictors of their class labels. In cases where all clusterings to be compared have the same number of clusters, the measure is equivalent to the mutual information between the cluster labels and the class labels. In cases where the numbers of clusters are different, however, it computes the reduction in the number of bits that would be required to encode (compress) the class labels if both the encoder and decoder have free acccess to the cluster labels. To achieve this encoding the estimated conditional probabilities of the class labels given the cluster labels must also be encoded. These estimated probabilities can be seen as a model for the class labels and their associated code length as a model cost.
Learning with Scope, with Application to Information Extraction and Classification
Blei, David, Bagnell, J Andrew, McCallum, Andrew
In probabilistic approaches to classification and information extraction, one typically builds a statistical model of words under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data. For example, in information extraction from web pages, word formatting may be indicative of extraction category in different ways on different web pages. The difficulty with using such features is capturing and exploiting the new regularities encountered in previously unseen data. In this paper, we propose a hierarchical probabilistic model that uses both local/scope-limited features, such as word formatting, and global features, such as word content. The local regularities are modeled as an unobserved random parameter which is drawn once for each local data set. This random parameter is estimated during the inference process and then used to perform classification with both the local and global features--- a procedure which is akin to automatically retuning the classifier to the local regularities on each newly encountered web page. Exact inference is intractable and we present approximations via point estimates and variational methods. Empirical results on large collections of web data demonstrate that this method significantly improves performance from traditional models of global features alone.
Tree-dependent Component Analysis
Bach, Francis R., Jordan, Michael I.
We present a generalization of independent component analysis (ICA), where instead of looking for a linear transform that makes the data components independent, we look for a transform that makes the data components well fit by a tree-structured graphical model. Treating the problem as a semiparametric statistical problem, we show that the optimal transform is found by minimizing a contrast function based on mutual information, a function that directly extends the contrast function used for classical ICA. We provide two approximations of this contrast function, one using kernel density estimation, and another using kernel generalized variance. This tree-dependent component analysis framework leads naturally to an efficient general multivariate density estimation technique where only bivariate density estimation needs to be performed.
Learning Hierarchical Object Maps Of Non-Stationary Environments with mobile robots
Anguelov, Dragomir, Biswas, Rahul, Koller, Daphne, Limketkai, Benson, Thrun, Sebastian
Building models, or maps, of robot environments is a highly active research area; however, most existing techniques construct unstructured maps and assume static environments. In this paper, we present an algorithm for learning object models of non-stationary objects found in office-type environments. Our algorithm exploits the fact that many objects found in office environments look alike (e.g., chairs, recycling bins). It does so through a two-level hierarchical representation, which links individual objects with generic shape templates of object classes. We derive an approximate EM algorithm for learning shape parameters at both levels of the hierarchy, using local occupancy grid maps for representing shape. Additionally, we develop a Bayesian model selection algorithm that enables the robot to estimate the total number of objects and object templates in the environment. Experimental results using a real robot equipped with a laser range finder indicate that our approach performs well at learning object-based maps of simple office environments. The approach outperforms a previously developed non-hierarchical algorithm that models objects but lacks class templates.
Speed Optimization In Unplanned Traffic Using Bio-Inspired Computing And Population Knowledge Base
Ghosal, Prasun, Chakraborty, Arijit, Banerjee, Sabyasachee, Barman, Satabdi
Bio-Inspired Algorithms on Road Traffic Congestion and safety is a very promising research problem. Searching for an efficient optimization method to increase the degree of speed optimization and thereby increasing the traffic Flow in an unplanned zone is a widely concerning issue. However, there has been a limited research effort on the optimization of the lane usage with speed optimization. The main objective of this article is to find avenues or techniques in a novel way to solve the problem optimally using the knowledge from analysis of speeds of vehicles, which, in turn will act as a guide for design of lanes optimally to provide better optimized traffic. The accident factors adjust the base model estimates for individual geometric design element dimensions and for traffic control features. The application of these algorithms in partially modified form in accordance of this novel Speed Optimization Technique in an Unplanned Traffic analysis technique is applied to the proposed design and speed optimization plan. The experimental results based on real life data are quite encouraging.
Understanding (dis)similarity measures
From a psychological point of view, a human being uses the notions of similarity and dissimilarity for problem solving, inductive reasoning, element categorization, or simply to search for information partially matching specific criteria. The ability to assess similarities between a newly given pattern and already known patterns is a distinctive feature of human thinking. It is therefore not strange that similarity and its dual concept dissimilarity are a fundamental part of many theories and applications in several fields, within or related to Artificial Intelligence, like Case Based Reasoning [1], Data Mining [2], Information Retrieval [3], Pattern Matching [4] or Neural Networks, as the Radial Basis Function network [5]. Many applications are characterized by the use of metrics for measuring differences between objects. Metric dissimilarities have been deeply studied but they are tied to a particular transitivity expression based on the triangle inequality. Very often metric (distance) functions are used due to our natural understanding of Euclidean spaces. However, not all metrics are Euclidean and many interesting dissimilarities are non-metric. 1 In a general sense, similarity and dissimilarity express a dual comparison between two elements. We argue that every property of a similarity should have a correspondence with one property of a dissimilarity and vice versa. This duality is commonly ignored, as well as some annoying properties (e.g.