AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Explaining K-Means Clustering

#artificialintelligenceJul-23-2020, 01:36:42 GMT

Today's data comes in all shapes and sizes. NLP data encompasses the written word, time-series data tracks sequential data movement over time (ie. Whichever dataset you possess, you can be sure there is an algorithm ready to decipher its secrets. In this article, we want to cover a clustering algorithm named KMeans which attempts to uncover hidden subgroups hiding in your dataset. Furthermore, we will examine what effects dimension reduction has on the quality of the clusters obtained from KMeans.

artificial intelligence, centroid, machine learning, (18 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Scalable Initialization Methods for Large-Scale Clustering

Hämäläinen, Joonas, Kärkkäinen, Tommi, Rossi, Tuomo

arXiv.org Machine LearningJul-23-2020

In this work, two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means|| type of an initialization strategy. The second proposal also utilizes multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means|| methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.

artificial intelligence, dataset, machine learning, (18 more...)

arXiv.org Machine Learning

2007.11937

Country:

Europe > Finland > Central Finland > Jyväskylä (0.04)
North America > United States > Texas (0.04)
North America > United States > Massachusetts > Plymouth County > Norwell (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Unsupervised Heterogeneous Coupling Learning for Categorical Representation

Zhu, Chengzhang, Cao, Longbing, Yin, Jianping

arXiv.org Machine LearningJul-21-2020

Complex categorical data is often hierarchically coupled with heterogeneous relationships between attributes and attribute values and the couplings between objects. Such value-to-object couplings are heterogeneous with complementary and inconsistent interactions and distributions. Limited research exists on unlabeled categorical data representations, ignores the heterogeneous and hierarchical couplings, underestimates data characteristics and complexities, and overuses redundant information, etc. The deep representation learning of unlabeled categorical data is challenging, overseeing such value-to-object couplings, complementarity and inconsistency, and requiring large data, disentanglement, and high computational power. This work introduces a shallow but powerful UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings and revealing heterogeneous distributions embedded in each type of couplings. UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings. Theoretical analysis shows that UNTIE can represent categorical data with maximal separability while effectively represent heterogeneous couplings and disclose their roles in categorical data. The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models on 25 categorical data sets with diversified characteristics.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

doi: 10.1109/TPAMI.2020.3010953

2007.1072

Country:

North America > United States > Wisconsin (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Spectral Clustering using Eigenspectrum Shape Based Nystrom Sampling

Bouneffouf, Djallel

arXiv.org Machine LearningJul-21-2020

Spectral clustering has shown a superior performance in analyzing the cluster structure. However, its computational complexity limits its application in analyzing large-scale data. To address this problem, many low-rank matrix approximating algorithms are proposed, including the Nystrom method - an approach with proven approximate error bounds. There are several algorithms that provide recipes to construct Nystrom approximations with variable accuracies and computing times. This paper proposes a scalable Nystrom-based clustering algorithm with a new sampling procedure, Centroid Minimum Sum of Squared Similarities (CMS3), and a heuristic on when to use it. Our heuristic depends on the eigen spectrum shape of the dataset, and yields competitive low-rank approximations in test datasets compared to the other state-of-the-art methods

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2007.11416

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Malaysia > Sarawak > Kuching (0.04)
Asia > Macao (0.04)
(12 more...)

Genre:

Research Report (0.84)
Instructional Material (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Top 10 Machine Learning Algorithms for ML Beginners

#artificialintelligenceJul-20-2020, 08:00:16 GMT

In the last decade machine learning becomes one of the hottest topics in the world, Andrew Ng considers it as the new electricity. In today's world machine learning powers many of the services we use -- recommendation systems like those on Netflix, YouTube, and Spotify; search engines like Google and Baidu; social-media feeds like Facebook and Twitter; voice assistants like Siri and Alexa. Having known that, let's see how machine learning works. In simple terms, machine learning algorithms use statistics to find patterns in massive amounts of data. The data are also known as the dataset, it's could contain images, texts, words, and clicks.

algorithm, artificial intelligence, machine learning, (14 more...)

#artificialintelligence

Industry:

Leisure & Entertainment (0.71)
Media > Film (0.37)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.30)

Add feedback

$p$-Norm Flow Diffusion for Local Graph Clustering

Fountoulakis, Kimon, Wang, Di, Yang, Shenghao

arXiv.org Machine LearningJul-20-2020

Local graph clustering and the closely related seed set expansion problem are primitives on graphs that are central to a wide range of analytic and learning tasks such as local clustering, community detection, nodes ranking and feature inference. Prior work on local graph clustering mostly falls into two categories with numerical and combinatorial roots respectively. In this work, we draw inspiration from both fields and propose a family of convex optimization formulations based on the idea of diffusion with p-norm network flow for $p\in (1,\infty)$. In the context of local clustering, we characterize the optimal solutions for these optimization problems and show their usefulness in finding low conductance cuts around input seed set. In particular, we achieve quadratic approximation of conductance in the case of $p=2$ similar to the Cheeger-type bounds of spectral methods, constant factor approximation when $p\rightarrow\infty$ similar to max-flow based methods, and a smooth transition for general $p$ values in between. Thus, our optimization formulation can be viewed as bridging the numerical and combinatorial approaches, and we can achieve the best of both worlds in terms of speed and noise robustness. We show that the proposed problem can be solved in strongly local running time for $p\ge 2$ and conduct empirical evaluations on both synthetic and real-world graphs to illustrate our approach compares favorably with existing methods.

artificial intelligence, machine learning, optimization problem, (18 more...)

arXiv.org Machine Learning

2005.0981

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

Add feedback

Latent Instrumental Variables as Priors in Causal Inference based on Independence of Cause and Mechanism

Sokolovska, Nataliya, Wuillemin, Pierre-Henri

arXiv.org Artificial IntelligenceJul-17-2020

Causal inference methods based on conditional independence construct Markov equivalent graphs, and cannot be applied to bivariate cases. The approaches based on independence of cause and mechanism state, on the contrary, that causal discovery can be inferred for two observations. In our contribution, we challenge to reconcile these two research directions. We study the role of latent variables such as latent instrumental variables and hidden common causes in the causal graphical structures. We show that the methods based on the independence of cause and mechanism, indirectly contain traces of the existence of the hidden instrumental variables. We derive a novel algorithm to infer causal relationships between two variables, and we validate the proposed method on simulated data and on a benchmark of cause-effect pairs. We illustrate by our experiments that the proposed approach is simple and extremely competitive in terms of empirical accuracy compared to the state-of-the-art methods.

artificial intelligence, instrumental variable, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2007.08812

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Selecting the Number of Clusters $K$ with a Stability Trade-off: an Internal Validation Criterion

Mourer, Alex, Forest, Florent, Lebbah, Mustapha, Azzag, Hanane, Lacaille, Jérôme

arXiv.org Machine LearningJul-16-2020

Model selection is a major challenge in non-parametric clustering. There is no universally admitted way to evaluate clustering results for the obvious reason that there is no ground truth against which results could be tested, as in supervised learning. The difficulty to find a universal evaluation criterion is a direct consequence of the fundamentally ill-defined objective of clustering. In this perspective, clustering stability has emerged as a natural and model-agnostic principle: an algorithm should find stable structures in the data. If data sets are repeatedly sampled from the same underlying distribution, an algorithm should find similar partitions. However, it turns out that stability alone is not a well-suited tool to determine the number of clusters. For instance, it is unable to detect if the number of clusters is too small. We propose a new principle for clustering validation: a good clustering should be stable, and within each cluster, there should exist no stable partition. This principle leads to a novel internal clustering validity criterion based on between-cluster and within-cluster stability, overcoming limitations of previous stability-based methods. We empirically show the superior ability of additive noise to discover structures, compared with sampling-based perturbation. We demonstrate the effectiveness of our method for selecting the number of clusters through a large number of experiments and compare it with existing evaluation methods.

artificial intelligence, machine learning, stability, (17 more...)

arXiv.org Machine Learning

2006.0853

Country: Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Data Stream Clustering: A Review

Zubaroğlu, Alaettin, Atalay, Volkan

arXiv.org Artificial IntelligenceJul-16-2020

Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.

data mining, machine learning, real time system, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s10462-020-09874-x

2007.10781

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > New Zealand > North Island > Waikato (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)

Genre:

Overview (1.00)
Research Report (0.81)

Industry: Information Technology > Services (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Architecture > Real Time Systems (1.00)

Add feedback

Modified Possibilistic Fuzzy C-Means Algorithm for Clustering Incomplete Data Sets

Rustam, null, Usman, Koredianto, Kamaruddin, Mudyawati, Chamidah, Dina, Nopendri, null, Saleh, Khaerudin, Eliskar, Yulinda, Marzuki, Ismail

arXiv.org Artificial IntelligenceJul-15-2020

Possibilistic fuzzy c-means (PFCM) algorithm is a reliable algorithm has been proposed to deal the weakness of two popular algorithms for clustering, fuzzy c-means (FCM) and possibilistic c-means (PCM). PFCM algorithm deals with the weaknesses of FCM in handling noise sensitivity and the weaknesses of PCM in the case of coincidence clusters. However, the PFCM algorithm can be only applied to cluster complete data sets. Therefore, in this study, we propose a modification of the PFCM algorithm that can be applied to incomplete data sets clustering. We modified the PFCM algorithm to OCSPFCM and NPSPFCM algorithms and measured performance on three things: 1) accuracy percentage, 2) a number of iterations to termination, and 3) centroid errors. Based on the results that both algorithms have the potential for clustering incomplete data sets. However, the performance of the NPSPFCM algorithm is better than the OCSPFCM algorithm for clustering incomplete data sets.

algorithm, artificial intelligence, machine learning, (11 more...)

arXiv.org Artificial Intelligence

2007.04908

Country:

Asia > Indonesia > Java > East Java > Surabaya (0.04)
Europe > Italy > Liguria > Genoa (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Indonesia > Sulawesi > South Sulawesi > Makassar (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Data Science (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback