Clustering
Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning
Acharya, Ayan, Hruschka, Eduardo R., Ghosh, Joydeep, Sarwar, Badrul, Ruvini, Jean-David
Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place. This paper describes a Bayesian framework that takes as input class labels from existing classifiers (designed based on labeled data from the source domain), as well as cluster labels from a cluster ensemble operating solely on the target data to be classified, and yields a consensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data. We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.
Block Modeling in Large Social Networks with Many Clusters
Biesan, Shawn (Baldwin Wallace University) | Anthony, Adam (Baldwin Wallace University) | desJardins, Marie (University of Maryland Baltimore County)
In this paper, we present an optimized version of the previously developed Block Modularity algorithm (Anthony,2009). The original algorithm was a fast, greedy method that effectively discovered a structured clustering in linked data and scaled very well with the number of nodes and edges. The optimized version is scalable in terms of the model complexity; the technique can now be used effectively to discover thousands of clusters in data sets with hundreds of thousands (and possibly more) nodes and edges. The optimization leads to an improvement of the runtime per iteration from cubic to quadratic with a small increase in the constant factor. The algorithm compares favorably with Karrer and Newman's Degree-Corrected Block Model (DCBM) in both runtime and quality of results.
Discovering Protein Clusters
Epstein, Susan (Hunter College and The Graduate Center of The City University of New York) | Li, Xingjian (Microsoft Online Services Division) | Valdez, Peter (Hunter College of The City University of New York) | Grayevsky, Sofia (Hunter College of The City University of New York) | Osisek, Eric (The Graduate Center of The City University of New York) | Yun, Xi (The Graduate Center of The City University of New York) | Xie, Lei (Hunter College of The City University of New York)
As biological data about genes and their interactions proliferates, scientists have the opportunity to identify sets of proteins whose interactions make them worthy of further investigation. This paper reports on a knowledge discovery technique to support that work. Foretell is an algorithm originally designed to support search for solutions to constraint satisfaction problems. Recent adaptations enable Foretell to detect sets of genes that interact heavily with one another. We provide empirical results, and describe ongoing work on biological meaning and knowledge infusion from the user.
Transforming Graph Data for Statistical Relational Learning
Rossi, R. A., McDowell, L. K., Aha, D. W., Neville, J.
Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of Statistical Relational Learning (SRL) algorithms to these domains. In this article, we examine and categorize techniques for transforming graph-based relational data to improve SRL algorithms. In particular, appropriate transformations of the nodes, links, and/or features of the data can dramatically affect the capabilities and results of SRL algorithms. We introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. More specifically, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed.
A Biomimetic Approach Based on Immune Systems for Classification of Unstructured Data
Hamou, Mohamed, Amine, Abdelmalek, Lokbani, Ahmed Chaouki
In this paper we present the results of unstructured data clustering in this case a textual data from Reuters 21578 corpus with a new biomimetic approach using immune system. Before experimenting our immune system, we digitalized textual data by the n-grams approach. The novelty lies on hybridization of n-grams and immune systems for clustering. The experimental results show that the recommended ideas are promising and prove that this method can solve the text clustering problem.
Learning Generative Models of Similarity Matrices
Rosales, Romer, Frey, Brendan J.
We describe a probabilistic (generative) view of affinity matrices along with inference algorithms for a subclass of problems associated with data clustering. This probabilistic view is helpful in understanding different models and algorithms that are based on affinity functions OF the data. IN particular, we show how(greedy) inference FOR a specific probabilistic model IS equivalent TO the spectral clustering algorithm.It also provides a framework FOR developing new algorithms AND extended models. AS one CASE, we present new generative data clustering models that allow us TO infer the underlying distance measure suitable for the clustering problem at hand. These models seem to perform well in a larger class of problems for which other clustering algorithms (including spectral clustering) usually fail. Experimental evaluation was performed in a variety point data sets, showing excellent performance.
Markov Random Walk Representations with Continuous Distributions
Yeang, Chen-Hsiang, Szummer, Martin
Representations based on random walks can exploit discrete data distributions for clustering and classification. We extend such representations from discrete to continuous distributions. Transition probabilities are now calculated using a diffusion equation with a diffusion coefficient that inversely depends on the data density. We relate this diffusion equation to a path integral and derive the corresponding path probability measure. The framework is useful for incorporating continuous data densities and prior knowledge.
Fast Graph Construction Using Auction Algorithm
In practical machine learning systems, graph based data representation has been widely used in various learning paradigms, ranging from unsupervised clustering to supervised classification. Besides those applications with natural graph or network structure data, such as social network analysis and relational learning, many other applications often involve a critical step in converting data vectors to an adjacency graph. In particular, a sparse subgraph extracted from the original graph is often required due to both theoretic and practical needs. Previous study clearly shows that the performance of different learning algorithms, e.g., clustering and classification, benefits from such sparse subgraphs with balanced node connectivity. However, the existing graph construction methods are either computationally expensive or with unsatisfactory performance. In this paper, we utilize a scalable method called auction algorithm and its parallel extension to recover a sparse yet nearly balanced subgraph with significantly reduced computational cost. Empirical study and comparison with the stateof-art approaches clearly demonstrate the superiority of the proposed method in both efficiency and accuracy.
Unsupervised Joint Alignment and Clustering using Bayesian Nonparametrics
Mattar, Marwan A., Hanson, Allen R., Learned-Miller, Erik G.
Joint alignment of a collection of functions is the process of independently transforming the functions so that they appear more similar to each other. Typically, such unsupervised alignment algorithms fail when presented with complex data sets arising from multiple modalities or make restrictive assumptions about the form of the functions or transformations, limiting their generality. We present a transformed Bayesian infinite mixture model that can simultaneously align and cluster a data set. Our model and associated learning scheme offer two key advantages: the optimal number of clusters is determined in a data-driven fashion through the use of a Dirichlet process prior, and it can accommodate any transformation function parameterized by a continuous parameter vector. As a result, it is applicable to a wide range of data types, and transformation functions. We present positive results on synthetic two-dimensional data, on a set of one-dimensional curves, and on various image data sets, showing large improvements over previous work. We discuss several variations of the model and conclude with directions for future work.
A Model-Based Approach to Rounding in Spectral Clustering
Poon, Leonard K. M., Liu, April H., Liu, Tengfei, Zhang, Nevin Lianwen
In spectral clustering, one defines a similarity matrix for a collection of data points, transforms the matrix to get the Laplacian matrix, finds the eigenvectors of the Laplacian matrix, and obtains a partition of the data using the leading eigenvectors. The last step is sometimes referred to as rounding, where one needs to decide how many leading eigenvectors to use, to determine the number of clusters, and to partition the data points. In this paper, we propose a novel method for rounding. The method differs from previous methods in three ways. First, we relax the assumption that the number of clusters equals the number of eigenvectors used. Second, when deciding the number of leading eigenvectors to use, we not only rely on information contained in the leading eigenvectors themselves, but also use subsequent eigenvectors. Third, our method is model-based and solves all the three subproblems of rounding using a class of graphical models called latent tree models. We evaluate our method on both synthetic and real-world data. The results show that our method works correctly in the ideal case where between-clusters similarity is 0, and degrades gracefully as one moves away from the ideal case.