Uncharted Forest a Technique for Exploratory Data Analysis of Provenance Studies

arXiv.org Machine Learning

Exploratory data analysis is a crucial task for developing effective classification models from high dimensional datasets. We explore the utility of a new unsupervised tree ensemble which we call, uncharted forest, for purposes of elucidating class associations, sample-sample associations, class heterogeneity, and uninformative classes for provenance studies. Uncharted forest partitions data along random variables which offer the most gain from various gain metrics, namely variance. After each tree is grown, a tally of every terminal node's sample membership is constructed such that a probabilistic measure for each sample being partitioned with one another can be stored in one matrix. That matrix may be readily viewed as a heat map, and the probabilities can be quantified via metrics which account for class or cluster membership. We display the advantages and limitations of this technique by applying it to 1 exemplary dataset and 3 provenance study datasets. The method is also validated by comparing the sample association metrics to clustering algorithms with known variance based clustering mechanisms.

Text Classification by Labeling Words

AAAI Conferences

Traditionally, text classifiers are built from labeled training examples. Labeling is usually done manually by human experts (or the users), which is a labor intensive and time consuming process. In the past few years, researchers investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper, we propose a different approach. Instead of labeling a set of documents, the proposed method labels a set of representative words for each class.

Centroid Networks for Few-Shot Clustering and Unsupervised Few-Shot Classification

arXiv.org Machine Learning

Traditional clustering algorithms such as K-means rely heavily on the nature of the chosen metric or data representation. To get meaningful clusters, these representations need to be tailored to the downstream task (e.g. cluster photos by object category, cluster faces by identity). Therefore, we frame clustering as a meta-learning task, few-shot clustering, which allows us to specify how to cluster the data at the meta-training level, despite the clustering algorithm itself being unsupervised. We propose Centroid Networks, a simple and efficient few-shot clustering method based on learning representations which are tailored both to the task to solve and to its internal clustering module. We also introduce unsupervised few-shot classification, which is conceptually similar to few-shot clustering, but is strictly harder than supervised* few-shot classification and therefore allows direct comparison with existing supervised few-shot classification methods. On Omniglot and miniImageNet, our method achieves accuracy competitive with popular supervised few-shot classification algorithms, despite using *no labels* from the support set. We also show performance competitive with state-of-the-art learning-to-cluster methods.

Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Neural Information Processing Systems

Little work has been done to directly combine the outputs of multiple supervised and unsupervised models. However, it can increase the accuracy and applicability of ensemble methods. First, we can boost the diversity of classification ensemble by incorporating multiple clustering outputs, each of which provides grouping constraints for the joint label predictions of a set of related objects. Secondly, ensemble of supervised models is limited in applications which have no access to raw data but to the meta-level model outputs. In this paper, we aim at calculating a consolidated classification solution for a set of objects by maximizing the consensus among both supervised predictions and unsupervised grouping constraints. We seek a global optimal label assignment for the target objects, which is different from the result of traditional majority voting and model combination approaches. We cast the problem into an optimization problem on a bipartite graph, where the objective function favors smoothness in the conditional probability estimates over the graph, as well as penalizes deviation from initial labeling of supervised models. We solve the problem through iterative propagation of conditional probability estimates among neighboring nodes, and interpret the method as conducting a constrained embedding in a transformed space, as well as a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives.

Modeling Skewed Class Distributions by Reshaping the Concept Space

AAAI Conferences

We introduce an approach to learning from imbalanced class distributions that does not change the underlying data distribution. The ICC algorithm decomposes majority classes into smaller sub-classes that create a more balanced class distribution. In this paper, we explain how ICC can not only addressthe class imbalance problem but may also increase the expressive power of the hypothesis space. We validate ICC and analyze alternative decomposition methods on well-known machine learning datasets as well as new problems in pervasive computing. Our results indicate that ICC performs as well or better than existing approaches to handling class imbalance.