Goto

Collaborating Authors

 Clustering


Dimensionality reduction methods for molecular simulations

arXiv.org Machine Learning

Molecular simulations produce very high-dimensional data-sets with millions of data points. As analysis methods are often unable to cope with so many dimensions, it is common to use dimensionality reduction and clustering methods to reach a reduced representation of the data. Yet these methods often fail to capture the most important features necessary for the construction of a Markov model. Here we demonstrate the results of various dimensionality reduction methods on two simulation data-sets, one of protein folding and another of protein-ligand binding. The methods tested include a k-means clustering variant, a non-linear auto encoder, principal component analysis and tICA. The dimension-reduced data is then used to estimate the implied timescales of the slowest process by a Markov state model analysis to assess the quality of the projection. The projected dimensions learned from the data are visualized to demonstrate which conformations the various methods choose to represent the molecular process.


Inhomogeneous Hypergraph Clustering with Applications

arXiv.org Machine Learning

Hypergraph partitioning is an important problem in machine learning, computer vision and network analytics. A widely used method for hypergraph partitioning relies on minimizing a normalized sum of the costs of partitioning hyperedges across clusters. Algorithmic solutions based on this approach assume that different partitions of a hyperedge incur the same cost. However, this assumption fails to leverage the fact that different subsets of vertices within the same hyperedge may have different structural importance. We hence propose a new hypergraph clustering technique, termed inhomogeneous hypergraph partitioning, which assigns different costs to different hyperedge cuts. We prove that inhomogeneous partitioning produces a quadratic approximation to the optimal solution if the inhomogeneous costs satisfy submodularity constraints. Moreover, we demonstrate that inhomogenous partitioning offers significant performance improvements in applications such as structure learning of rankings, subspace segmentation and motif clustering.


Introduction to K-means Clustering

#artificialintelligence

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the dataset. Each centroid defines one of the clusters.


Optimizing K-Means Clustering for Time Series Data - DZone AI

@machinelearnbot

Here at New Relic, we collect 1.37 billion data points per minute. A vast amount of the data we collect, analyze, and display for our customers is stored as time series. In an effort to build relationships between applications and other entities, such as servers and containers, for new, intelligent products like New Relic Radar, we're constantly exploring faster and more efficient methods of grouping time series data. Given the amount of data we collect, faster clustering times are crucial. A popular method of grouping data is k-means clustering.


Generating Time-Based Label Refinements to Discover More Precise Process Models

arXiv.org Artificial Intelligence

Process mining is a research field focused on the analysis of event data with the aim of extracting insights related to dynamic behavior. Applying process mining techniques on data from smart home environments has the potential to provide valuable insights into (un)healthy habits and to contribute to ambient assisted living solutions. Finding the right event labels to enable the application of process mining techniques is however far from trivial, as simply using the triggering sensor as the label for sensor events results in uninformative models that allow for too much behavior (i.e., the models are overgeneralizing). Refinements of sensor level event labels suggested by domain experts have been shown to enable discovery of more precise and insightful process models. However, there exists no automated approach to generate refinements of event labels in the context of process mining. In this paper we propose a framework for the automated generation of label refinements based on the time attribute of events, allowing us to distinguish behaviourally different instances of the same event type based on their time attribute. We show on a case study with real-life smart home event data that using automatically generated refined labels in process discovery, we can find more specific, and therefore more insightful, process models. We observe that one label refinement could have an effect on the usefulness of other label refinements when used together. Therefore, we explore four strategies to generate useful combinations of multiple label refinements and evaluate those on three real-life smart home event logs.


Clustering Mixed Datasets Using Homogeneity Analysis with Applications to Big Data

arXiv.org Machine Learning

Datasets with a mixture of categorical and numerical attributes are pervasive in applications from business and socioeconomic settings. Clustering these datasets is an important activity in their analysis. Techniques to cluster these datasets have been developed by researchers, see for example [1], [2] and [3]. Techniques to cluster mixed datasets either prescribe a probabilistic generative model [4] or use a dissimilarity measure [5] to compute a dissimilarity matrix that is then clustered. Each of these approaches have issues that need to be addressed when they are applied to big datasets - datasets with a large number of instances compared to attributes.


Density Based Spatial Clustering of Applications with Noise (DBSCAN)

#artificialintelligence

DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters. This is very different from KMeans, where an observation becomes a part of cluster represented by nearest centroid. DBSCAN clustering can identify outliers, observations which won't belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we don't know how many clusters could be there in the data.


Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of Information

arXiv.org Machine Learning

In this paper, we provide an approach to clustering relational matrices whose entries correspond to either similarities or dissimilarities between objects. Our approach is based on the value of information, a parameterized, information-theoretic criterion that measures the change in costs associated with changes in information. Optimizing the value of information yields a deterministic annealing style of clustering with many benefits. For instance, investigators avoid needing to a priori specify the number of clusters, as the partitions naturally undergo phase changes, during the annealing process, whereby the number of clusters changes in a data-driven fashion. The global-best partition can also often be identified.


From social media to public health surveillance: Word embedding based clustering method for twitter classification

@machinelearnbot

Social media provide a low-cost alternative source for public health surveillance and health-related classification plays an important role to identify useful information. We summarized the recent classification methods using social media in public health. These methods rely on bag-of-words (BOW) model and have difficulty grasping the semantic meaning of texts. Unlike these methods, we present a word embedding based clustering method. Word embedding is one of the strongest trends in Natural Language Processing (NLP) at this moment. It learns the optimal vectors from surrounding words and the vectors can represent the semantic information of words.


Energy Clustering

arXiv.org Machine Learning

Energy statistics was proposed by Sz\'{e}kely in the 80's inspired by the Newtonian gravitational potential from classical mechanics, and it provides a hypothesis test for equality of distributions. It was further generalized from Euclidean spaces to metric spaces of strong negative type, and more recently, a connection with reproducing kernel Hilbert spaces (RKHS) was established. Here we consider the clustering problem from an energy statistics theory perspective, providing a precise mathematical formulation yielding a quadratically constrained quadratic program (QCQP) in the associated RKHS, thus establishing the connection with kernel methods. We show that this QCQP is equivalent to kernel $k$-means optimization problem once the kernel is fixed. These results imply a first principles derivation of kernel $k$-means from energy statistics. However, energy statistics fixes a family of standard kernels. Furthermore, we also consider a weighted version of energy statistics, making connection to graph partitioning problems. To find local optimizers of such QCQP we propose an iterative algorithm based on Hartigan's method, which in this case has the same computational cost as kernel $k$-means algorithm, based on Lloyd's heuristic, but usually with better clustering quality. We provide carefully designed numerical experiments showing the superiority of the proposed method compared to kernel $k$-means, spectral clustering, standard $k$-means, and Gaussian mixture models in a variety of settings.