AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Dimensionality reduction methods for molecular simulations

Doerr, Stefan, Ariz-Extreme, Igor, Harvey, Matthew J., De Fabritiis, Gianni

arXiv.org Machine LearningNov-2-2017

Molecular simulations produce very high-dimensional data-sets with millions of data points. As analysis methods are often unable to cope with so many dimensions, it is common to use dimensionality reduction and clustering methods to reach a reduced representation of the data. Yet these methods often fail to capture the most important features necessary for the construction of a Markov model. Here we demonstrate the results of various dimensionality reduction methods on two simulation data-sets, one of protein folding and another of protein-ligand binding. The methods tested include a k-means clustering variant, a non-linear auto encoder, principal component analysis and tICA. The dimension-reduced data is then used to estimate the implied timescales of the slowest process by a Markov state model analysis to assess the quality of the projection. The projected dimensions learned from the data are visualized to demonstrate which conformations the various methods choose to represent the molecular process.

artificial intelligence, auto encoder, machine learning, (16 more...)

arXiv.org Machine Learning

1710.10629

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Inhomogeneous Hypergraph Clustering with Applications

Li, Pan, Milenkovic, Olgica

arXiv.org Machine LearningNov-2-2017

Hypergraph partitioning is an important problem in machine learning, computer vision and network analytics. A widely used method for hypergraph partitioning relies on minimizing a normalized sum of the costs of partitioning hyperedges across clusters. Algorithmic solutions based on this approach assume that different partitions of a hyperedge incur the same cost. However, this assumption fails to leverage the fact that different subsets of vertices within the same hyperedge may have different structural importance. We hence propose a new hypergraph clustering technique, termed inhomogeneous hypergraph partitioning, which assigns different costs to different hyperedge cuts. We prove that inhomogeneous partitioning produces a quadratic approximation to the optimal solution if the inhomogeneous costs satisfy submodularity constraints. Moreover, we demonstrate that inhomogenous partitioning offers significant performance improvements in applications such as structure learning of rankings, subspace segmentation and motif clustering.

artificial intelligence, hypergraph, machine learning, (16 more...)

arXiv.org Machine Learning

1709.01249

Country: North America > United States (0.28)

Genre: Research Report (0.81)

Industry: Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.89)

Add feedback

Introduction to K-means Clustering

#artificialintelligenceNov-1-2017, 19:15:12 GMT

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the dataset. Each centroid defines one of the clusters.

artificial intelligence, centroid, machine learning, (4 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.85)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Optimizing K-Means Clustering for Time Series Data - DZone AI

@machinelearnbotNov-1-2017, 10:40:11 GMT

Here at New Relic, we collect 1.37 billion data points per minute. A vast amount of the data we collect, analyze, and display for our customers is stored as time series. In an effort to build relationships between applications and other entities, such as servers and containers, for new, intelligent products like New Relic Radar, we're constantly exploring faster and more efficient methods of grouping time series data. Given the amount of data we collect, faster clustering times are crucial. A popular method of grouping data is k-means clustering.

artificial intelligence, centroid, machine learning, (13 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.74)

Add feedback

Generating Time-Based Label Refinements to Discover More Precise Process Models

Tax, Niek, Alasgarov, Emin, Sidorova, Natalia, van der Aalst, Wil M. P., Haakma, Reinder

arXiv.org Artificial IntelligenceOct-31-2017

Process mining is a research field focused on the analysis of event data with the aim of extracting insights related to dynamic behavior. Applying process mining techniques on data from smart home environments has the potential to provide valuable insights into (un)healthy habits and to contribute to ambient assisted living solutions. Finding the right event labels to enable the application of process mining techniques is however far from trivial, as simply using the triggering sensor as the label for sensor events results in uninformative models that allow for too much behavior (i.e., the models are overgeneralizing). Refinements of sensor level event labels suggested by domain experts have been shown to enable discovery of more precise and insightful process models. However, there exists no automated approach to generate refinements of event labels in the context of process mining. In this paper we propose a framework for the automated generation of label refinements based on the time attribute of events, allowing us to distinguish behaviourally different instances of the same event type based on their time attribute. We show on a case study with real-life smart home event data that using automatically generated refined labels in process discovery, we can find more specific, and therefore more insightful, process models. We observe that one label refinement could have an effect on the usefulness of other label refinements when used together. Therefore, we explore four strategies to generate useful combinations of multiple label refinements and evaluate those on three real-life smart home event logs.

data mining, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

1705.09359

Country: Europe > Austria (0.28)

Genre:

Research Report (0.64)
Workflow (0.46)

Industry:

Health & Medicine (1.00)
Information Technology > Smart Houses & Appliances (0.76)
Materials > Metals & Mining (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Clustering Mixed Datasets Using Homogeneity Analysis with Applications to Big Data

Sambasivan, Rajiv, Das, Sourish

arXiv.org Machine LearningOct-30-2017

Datasets with a mixture of categorical and numerical attributes are pervasive in applications from business and socioeconomic settings. Clustering these datasets is an important activity in their analysis. Techniques to cluster these datasets have been developed by researchers, see for example [1], [2] and [3]. Techniques to cluster mixed datasets either prescribe a probabilistic generative model [4] or use a dissimilarity measure [5] to compute a dissimilarity matrix that is then clustered. Each of these approaches have issues that need to be addressed when they are applied to big datasets - datasets with a large number of instances compared to attributes.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1608.04961

Country: Europe > Austria (0.28)

Genre: Research Report (1.00)

Industry:

Consumer Products & Services > Travel (0.68)
Transportation > Passenger (0.47)
Transportation > Air (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.97)

Add feedback

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

#artificialintelligenceOct-27-2017, 15:20:19 GMT

DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters. This is very different from KMeans, where an observation becomes a part of cluster represented by nearest centroid. DBSCAN clustering can identify outliers, observations which won't belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we don't know how many clusters could be there in the data.

artificial intelligence, dbscan, machine learning, (7 more...)

#artificialintelligence

Country: Asia > India (0.07)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of Information

Sledge, Isaac J., Principe, Jose C.

arXiv.org Machine LearningOct-27-2017

In this paper, we provide an approach to clustering relational matrices whose entries correspond to either similarities or dissimilarities between objects. Our approach is based on the value of information, a parameterized, information-theoretic criterion that measures the change in costs associated with changes in information. Optimizing the value of information yields a deterministic annealing style of clustering with many benefits. For instance, investigators avoid needing to a priori specify the number of clusters, as the partitions naturally undergo phase changes, during the annealing process, whereby the number of clusters changes in a data-driven fashion. The global-best partition can also often be identified.

information, matrix, partition, (16 more...)

arXiv.org Machine Learning

1710.10381

Country:

Asia > Russia (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

From social media to public health surveillance: Word embedding based clustering method for twitter classification

@machinelearnbotOct-26-2017, 19:50:07 GMT

Social media provide a low-cost alternative source for public health surveillance and health-related classification plays an important role to identify useful information. We summarized the recent classification methods using social media in public health. These methods rely on bag-of-words (BOW) model and have difficulty grasping the semantic meaning of texts. Unlike these methods, we present a word embedding based clustering method. Word embedding is one of the strongest trends in Natural Language Processing (NLP) at this moment. It learns the optimal vectors from surrounding words and the vectors can represent the semantic information of words.

classification, machine learning, natural language, (9 more...)

@machinelearnbot

Industry:

Health & Medicine > Public Health (1.00)
Health & Medicine > Epidemiology (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.63)

Add feedback

Energy Clustering

França, Guilherme, Vogelstein, Joshua T.

arXiv.org Machine LearningOct-26-2017

Energy statistics was proposed by Sz\'{e}kely in the 80's inspired by the Newtonian gravitational potential from classical mechanics, and it provides a hypothesis test for equality of distributions. It was further generalized from Euclidean spaces to metric spaces of strong negative type, and more recently, a connection with reproducing kernel Hilbert spaces (RKHS) was established. Here we consider the clustering problem from an energy statistics theory perspective, providing a precise mathematical formulation yielding a quadratically constrained quadratic program (QCQP) in the associated RKHS, thus establishing the connection with kernel methods. We show that this QCQP is equivalent to kernel $k$-means optimization problem once the kernel is fixed. These results imply a first principles derivation of kernel $k$-means from energy statistics. However, energy statistics fixes a family of standard kernels. Furthermore, we also consider a weighted version of energy statistics, making connection to graph partitioning problems. To find local optimizers of such QCQP we propose an iterative algorithm based on Hartigan's method, which in this case has the same computational cost as kernel $k$-means algorithm, based on Lloyd's heuristic, but usually with better clustering quality. We provide carefully designed numerical experiments showing the superiority of the proposed method compared to kernel $k$-means, spectral clustering, standard $k$-means, and Gaussian mixture models in a variety of settings.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1710.09859

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback