Clustering
Minimum Spectral Connectivity Projection Pursuit
Hofmeyr, David P., Pavlidis, Nicos G., Eckley, Idris A.
We study the problem of determining the optimal low dimensional projection for maximising the separability of a binary partition of an unlabelled dataset, as measured by spectral graph theory. This is achieved by finding projections which minimise the second eigenvalue of the graph Laplacian of the projected data, which corresponds to a non-convex, non-smooth optimisation problem. We show that the optimal univariate projection based on spectral connectivity converges to the vector normal to the maximum margin hyperplane through the data, as the scaling parameter is reduced to zero. This establishes a connection between connectivity as measured by spectral graph theory and maximal Euclidean separation. The computational cost associated with each eigen-problem is quadratic in the number of data. To mitigate this issue, we propose an approximation method using microclusters with provable approximation error bounds. Combining multiple binary partitions within a divisive hierarchical model allows us to construct clustering solutions admitting clusters with varying scales and lying within different subspaces. We evaluate the performance of the proposed method on a large collection of benchmark datasets and find that it compares favourably with existing methods for projection pursuit and dimension reduction for data clustering.
A More Effective Approach to Unsupervised Learning with Time Series Data
Come see Anshuman Guha, Data Scientist from Spark Cognition Speak at ODSC West. In machine learning, the most traditional and popular methods of clustering are hierarchical clustering (similarity-based clustering) and k-means clustering (feature-based clustering). Hierarchical clustering, put simply, is grouping together points in a vector space that are closest in distance from each other. Hierarchical clustering works great on small datasets. A major advantage of this method is the user does not need to know anything about the dataset in advance and specify any hyper-parameters (like number of clusters).
Unified Spectral Clustering with Optimal Graph
Kang, Zhao, Peng, Chong, Cheng, Qiang, Xu, Zenglin
Spectral clustering has found extensive use in many areas. Most traditional spectral clustering algorithms work in three separate steps: similarity graph construction; continuous labels learning; discretizing the learned labels by k-means clustering. Such common practice has two potential flaws, which may lead to severe information loss and performance degradation. First, predefined similarity graph might not be optimal for subsequent clustering. It is well-accepted that similarity graph highly affects the clustering results. To this end, we propose to automatically learn similarity information from data and simultaneously consider the constraint that the similarity matrix has exact c connected components if there are c clusters. Second, the discrete solution may deviate from the spectral solution since k-means method is well-known as sensitive to the initialization of cluster centers. In this work, we transform the candidate solution into a new one that better approximates the discrete one. Finally, those three subtasks are integrated into a unified framework, with each subtask iteratively boosted by using the results of the others towards an overall optimal solution. It is known that the performance of a kernel method is largely determined by the choice of kernels. To tackle this practical problem of how to select the most suitable kernel for a particular data set, we further extend our model to incorporate multiple kernel learning ability. Extensive experiments demonstrate the superiority of our proposed method as compared to existing clustering approaches.
CUR Decompositions, Similarity Matrices, and Subspace Clustering
Aldroubi, Akram, Hamm, Keaton, Koku, Ahmet Bugra, Sekmen, Ali
A general framework for solving the subspace clustering problem using the CUR decomposition is presented. The CUR decomposition provides a natural way to construct similarity matrices for data that come from a union of unknown subspaces $\mathscr{U}=\underset{i=1}{\overset{M}\bigcup}S_i$. The similarity matrices thus constructed give the exact clustering in the noise-free case. A simple adaptation of the technique also allows clustering of noisy data. Two known methods for subspace clustering can be derived from the CUR technique. Experiments on synthetic and real data are presented to test the method.
Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means
Forster, Dennis, Lรผcke, Jรถrg
One iteration of $k$-means or EM for Gaussian mixture models (GMMs) scales linearly with the number of data points $N$, the number of clusters $C$, and the data dimensionality $D$. In this study, we explore whether one iteration of $k$-means or EM for GMMs can scale sublinearly with $C$ at run-time, while the increase of the clustering objective remains effective. The tool we apply for complexity reduction is variational EM, which is typically applied to make training of generative models with exponentially many hidden states tractable. Here, we apply novel theoretical results on truncated variational EM to make tractable clustering algorithms more efficient. The basic idea is the use of a partial variational E-step which reduces the linear complexity of $\mathcal{O}(NCD)$ required for a full E-step to a sublinear complexity. Our main observation is that the linear dependency on $C$ can be reduced to a dependency on a much smaller parameter $G$, related to the cluster neighborhood relationship. We focus on two versions of partial variational EM for clustering: variational GMM, scaling with $\mathcal{O}(NG^2D)$, and variational $k$-means, scaling with $\mathcal{O}(NGD)$ per iteration. Empirical results then show that these algorithms still require comparable numbers of iterations to increase the clustering objective to the same values as $k$-means. For data with many clusters, we consequently observe reductions of the net computational demands between two and three orders of magnitude. More generally, our results provide substantial empirical evidence in favor of clustering to scale sublinearly with $C$.
Multi-Relevance Transfer Learning
Transfer learning aims to faciliate learning tasks in a label-scarce target domain by leveraging knowledge from a related source domain with plenty of labeled data. Often times we may have multiple domains with little or no labeled data as targets waiting to be solved. Most existing efforts tackle target domains separately by modeling the `source-target' pairs without exploring the relatedness between them, which would cause loss of crucial information, thus failing to achieve optimal capability of knowledge transfer. In this paper, we propose a novel and effective approach called Multi-Relevance Transfer Learning (MRTL) for this purpose, which can simultaneously transfer different knowledge from the source and exploits the shared common latent factors between target domains. Specifically, we formulate the problem as an optimization task based on a collective nonnegative matrix tri-factorization framework. The proposed approach achieves both source-target transfer and target-target leveraging by sharing multiple decomposed latent subspaces. Further, an alternative minimization learning algorithm is developed with convergence guarantee. Empirical study validates the performance and effectiveness of MRTL compared to the state-of-the-art methods.
Parallel K-Means Clustering With Reducer Function - DZone AI
In functional programming, a fold is a higher-order function, also known as reduce and accumulate, whose purpose is to reduce a given data structure, typically a sequence of elements, into a single value. For example, a reduction could return an average value for a series of numbers, calculating a summation, maximum value, or minimum value. The fold function takes an initial value, commonly called the accumulator, which is used as a container for intermediate results. Then, the second argument it takes is a binary expression that acts as a reduction function to apply against each element in the sequence to finally return the new value for the accumulator. In general, reduction works as follows.
K-means, SOM, k-nn or classical clustering methods?
The best-known optimization clustering algorithm is k-means clustering. Unlike hierarchical clustering methods that require processing time proportional to the square or cube of the number of observations, the time required by the k-means algorithm is proportional to the number of observations. This means that k-means clustering can be used on larger data sets. In fact, k-means clustering is inappropriate for small ( 100 observations) data sets. If the data set is small, the k-means solution becomes sensitive to the order in which the observations appear (the order effect).
A Spectral Algorithm with Additive Clustering for the Recovery of Overlapping Communities in Networks
Kaufmann, Emilie, Bonald, Thomas, Lelarge, Marc
The commonly accepted definition of a community is that nodes tend to be more densely connected within a community than with the rest of the graph. Communities are often hidden in practice and recovering the community structure directly from the graph is a key step in the analysis of these datasets. Spectral algorithms are popular methods for detecting communities [26], that consist in two phases. First, a spectral embedding is built, where then nodes of the graph are projected onto some low dimensional space generated by well-chosen eigenvectors of some matrix related to the graph (e.g., the adjacency matrix or a Laplacian matrix). Then, a clustering algorithm (e.g.,k -means ork -median) is applied to then embedded vectors to obtain a partition of the nodes into communities.
Distribution-Preserving k-Anonymity
Wei, Dennis, Ramamurthy, Karthikeyan Natesan, Varshney, Kush R.
Preserving the privacy of individuals by protecting their sensitive attributes is an important consideration during microdata release. However, it is equally important to preserve the quality or utility of the data for at least some targeted workloads. We propose a novel framework for privacy preservation based on the k-anonymity model that is ideally suited for workloads that require preserving the probability distribution of the quasi-identifier variables in the data. Our framework combines the principles of distribution-preserving quantization and k-member clustering, and we specialize it to two variants that respectively use intra-cluster and Gaussian dithering of cluster centers to achieve distribution preservation. We perform theoretical analysis of the proposed schemes in terms of distribution preservation, and describe their utility in workloads such as covariate shift and transfer learning where such a property is necessary. Using extensive experiments on real-world Medical Expenditure Panel Survey data, we demonstrate the merits of our algorithms over standard k-anonymization for a hallmark health care application where an insurance company wishes to understand the risk in entering a new market. Furthermore, by empirically quantifying the reidentification risk, we also show that the proposed approaches indeed maintain k-anonymity.