AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Community recovery in non-binary and temporal stochastic block models

Avrachenkov, Konstantin, Dreveton, Maximilien, Leskelä, Lasse

arXiv.org Artificial IntelligenceAug-30-2022

This article studies the estimation of latent community memberships from pairwise interactions in a network of $N$ nodes, where the observed interactions can be of arbitrary type, including binary, categorical, and vector-valued, and not excluding even more general objects such as time series or spatial point patterns. As a generative model for such data, we introduce a stochastic block model with a general measurable interaction space $\mathcal S$, for which we derive information-theoretic bounds for the minimum achievable error rate. These bounds yield sharp criteria for the existence of consistent and strongly consistent estimators in terms of data sparsity, statistical similarity between intra- and inter-block interaction distributions, and the shape and size of the interaction space. The general framework makes it possible to study temporal and multiplex networks with $\mathcal S = \{0,1\}^T$, in settings where both $N \to \infty$ and $T \to \infty$, and the temporal interaction patterns are correlated over time. For temporal Markov interactions, we derive sharp consistency thresholds. We also present fast online estimation algorithms which fully utilise the non-binary nature of the observed data. Numerical experiments on synthetic and real data show that these algorithms rapidly produce accurate estimates even for very sparse data arrays.

interaction, stochastic block model, theorem 3, (11 more...)

arXiv.org Artificial Intelligence

2008.0479

Country:

North America > United States (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (1.00)

Industry:

Education (0.67)
Health & Medicine (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

A Dynamic Mode Decomposition Approach for Decentralized Spectral Clustering of Graphs

Zhu, Hongyu, Klus, Stefan, Sahai, Tuhin

arXiv.org Artificial IntelligenceAug-29-2022

We propose a novel robust decentralized graph clustering algorithm that is provably equivalent to the popular spectral clustering approach. Our proposed method uses the existing wave equation clustering algorithm that is based on propagating waves through the graph. However, instead of using a fast Fourier transform (FFT) computation at every node, our proposed approach exploits the Koopman operator framework. Specifically, we show that propagating waves in the graph followed by a local dynamic mode decomposition (DMD) computation at every node is capable of retrieving the eigenvalues and the local eigenvector components of the graph Laplacian, thereby providing local cluster assignments for all nodes. We demonstrate that the DMD computation is more robust than the existing FFT based approach and requires 20 times fewer steps of the wave equation to accurately recover the clustering information and reduces the relative error by orders of magnitude. We demonstrate the decentralized approach on a range of graph clustering problems.

eigenvalue, eigenvector, graph, (16 more...)

arXiv.org Artificial Intelligence

2203.00004

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Surrey (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

fMBN-E: Efficient Unsupervised Network Structure Ensemble and Selection for Clustering

Zhang, Xiao-Lei

arXiv.org Artificial IntelligenceAug-29-2022

It is known that unsupervised nonlinear dimensionality reduction and clustering is sensitive to the selection of hyperparameters, particularly for deep learning based methods, which hinders its practical use. How to select a proper network structure that may be dramatically different in different applications is a hard issue for deep models, given little prior knowledge of data. In this paper, we aim to automatically determine the optimal network structure of a deep model, named multilayer bootstrap networks (MBN), via simple ensemble learning and selection techniques. Specifically, we first propose an MBN ensemble (MBN-E) algorithm which concatenates the sparse outputs of a set of MBN base models with different network structures into a new representation. Then, we take the new representation produced by MBN-E as a reference for selecting the optimal MBN base models. Moreover, we propose a fast version of MBN-E (fMBN-E), which is not only theoretically even faster than a single standard MBN but also does not increase the estimation error of MBN-E. Importantly, MBN-E and its ensemble selection techniques maintain the simple formulation of MBN that is based on one-nearest-neighbor learning. Empirically, comparing to a number of advanced deep clustering methods and as many as 20 representative unsupervised ensemble learning and selection methods, the proposed methods reach the state-of-the-art performance without manual hyperparameter tuning. fMBN-E is empirically even hundreds of times faster than MBN-E without suffering performance degradation. The applications to image segmentation and graph data mining further demonstrate the advantage of the proposed methods.

base model, ensemble, mbn-e, (17 more...)

arXiv.org Artificial Intelligence

2107.02071

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Leachable Component Clustering

Cheng, Miao, You, Xinge

arXiv.org Artificial IntelligenceAug-28-2022

Clustering attempts to partition data instances into several distinctive groups, while the similarities among data belonging to the common partition can be principally reserved. Furthermore, incomplete data frequently occurs in many realworld applications, and brings perverse influence on pattern analysis. As a consequence, the specific solutions to data imputation and handling are developed to conduct the missing values of data, and independent stage of knowledge exploitation is absorbed for information understanding. In this work, a novel approach to clustering of incomplete data, termed leachable component clustering, is proposed. Rather than existing methods, the proposed method handles data imputation with Bayes alignment, and collects the lost patterns in theory. Due to the simple numeric computation of equations, the proposed method can learn optimized partitions while the calculation efficiency is held. Experiments on several artificial incomplete data sets demonstrate that, the proposed method is able to present superior performance compared with other state-of-the-art algorithms.

imputation, incomplete data, international conference, (17 more...)

arXiv.org Artificial Intelligence

2208.13217

Country:

Asia > China > Guangxi Province (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(13 more...)

Genre: Research Report > Promising Solution (0.35)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Incremental Semantic Localization using Hierarchical Clustering of Object Association Sets

Hu, Lan, Luo, Zhongwei, Yuan, Runze, Cao, Yuchen, Wei, Jiaxin, Kneip, Kai Wangand Laurent

arXiv.org Artificial IntelligenceAug-28-2022

We present a novel approach for relocalization or place recognition, a fundamental problem to be solved in many robotics, automation, and AR applications. Rather than relying on often unstable appearance information, we consider a situation in which the reference map is given in the form of localized objects. Our localization framework relies on 3D semantic object detections, which are then associated to objects in the map. Possible pair-wise association sets are grown based on hierarchical clustering using a merge metric that evaluates spatial compatibility. The latter notably uses information about relative object configurations, which is invariant with respect to global transformations. Association sets are furthermore updated and expanded as the camera incrementally explores the environment and detects further objects. We test our algorithm in several challenging situations including dynamic scenes, large view-point changes, and scenes with repeated instances. Our experiments demonstrate that our approach outperforms prior art in terms of both robustness and accuracy.

computer vision, proceedings, relocalization, (14 more...)

arXiv.org Artificial Intelligence

2208.1321

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Outlier Detection Techniques in Python

#artificialintelligenceAug-26-2022, 16:14:08 GMT

Outlier detection, which is the process of identifying extreme values in data, has many applications across a wide variety of industries including finance, insurance, cybersecurity and healthcare. In finance, for example, it can detect malicious events like credit card fraud. In insurance, it can identify forged or fabricated documents. In cybersecurity, it is used for identifying malicious behaviors like password theft and phishing. Finally, outlier detection has been used for rare disease detection in a healthcare context.

iqr, outlier, outlier detection, (15 more...)

#artificialintelligence

Country: North America > United States (0.05)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)

Add feedback

Elastic Product Quantization for Time Series

Robberechts, Pieter, Meert, Wannes, Davis, Jesse

arXiv.org Artificial IntelligenceAug-26-2022

Analyzing numerous or long time series is difficult in practice due to the high storage costs and computational requirements. Therefore, techniques have been proposed to generate compact similarity-preserving representations of time series, enabling real-time similarity search on large in-memory data collections. However, the existing techniques are not ideally suited for assessing similarity when sequences are locally out of phase. In this paper, we propose the use of product quantization for efficient similarity-based comparison of time series under time warping. The idea is to first compress the data by partitioning the time series into equal length sub-sequences which are represented by a short code. The distance between two time series can then be efficiently approximated by pre-computed elastic distances between their codes. The partitioning into sub-sequences forces unwanted alignments, which we address with a pre-alignment step using the maximal overlap discrete wavelet transform (MODWT). To demonstrate the efficiency and accuracy of our method, we perform an extensive experimental evaluation on benchmark datasets in nearest neighbors classification and clustering applications. Overall, the proposed solution emerges as a highly efficient (both in terms of memory usage and computation time) replacement for elastic measures in time series applications.

pqdtw, subspace, time sery, (13 more...)

arXiv.org Artificial Intelligence

2201.01856

Country:

North America > United States > New York > New York County > New York City (0.05)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.05)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Data Science > Data Quality > Data Transformation (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Shallow decision trees for explainable $k$-means clustering

Laber, Eduardo, Murtinho, Lucas, Oliveira, Felipe

arXiv.org Artificial IntelligenceAug-26-2022

A number of recent works have employed decision trees for the construction of explainable partitions that aim to minimize the $k$-means cost function. These works, however, largely ignore metrics related to the depths of the leaves in the resulting tree, which is perhaps surprising considering how the explainability of a decision tree depends on these depths. To fill this gap in the literature, we propose an efficient algorithm that takes into account these metrics. In experiments on 16 datasets, our algorithm yields better results than decision-tree clustering algorithms such as the ones presented in \cite{dasgupta2020explainable}, \cite{frost2020exkmc}, \cite{laber2021price} and \cite{DBLP:conf/icml/MakarychevS21}, typically achieving lower or equivalent costs with considerably shallower trees. We also show, through a simple adaptation of existing techniques, that the problem of building explainable partitions induced by binary trees for the $k$-means cost function does not admit an $(1+\epsilon)$-approximation in polynomial time unless $P=NP$, which justifies the quest for approximation algorithms and/or heuristics.

algorithm, exshallow, partition, (17 more...)

arXiv.org Artificial Intelligence

2112.14718

Country:

South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Monterey County > Monterey (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Adaptively-weighted Integral Space for Fast Multiview Clustering

Chen, Man-Sheng, Liu, Tuo, Wang, Chang-Dong, Huang, Dong, Lai, Jian-Huang

arXiv.org Artificial IntelligenceAug-25-2022

Multiview clustering has been extensively studied to take advantage of multi-source information to improve the clustering performance. In general, most of the existing works typically compute an n * n affinity graph by some similarity/distance metrics (e.g. the Euclidean distance) or learned representations, and explore the pairwise correlations across views. But unfortunately, a quadratic or even cubic complexity is often needed, bringing about difficulty in clustering largescale datasets. Some efforts have been made recently to capture data distribution in multiple views by selecting view-wise anchor representations with k-means, or by direct matrix factorization on the original observations. Despite the significant success, few of them have considered the view-insufficiency issue, implicitly holding the assumption that each individual view is sufficient to recover the cluster structure. Moreover, the latent integral space as well as the shared cluster structure from multiple insufficient views is not able to be simultaneously discovered. In view of this, we propose an Adaptively-weighted Integral Space for Fast Multiview Clustering (AIMC) with nearly linear complexity. Specifically, view generation models are designed to reconstruct the view observations from the latent integral space with diverse adaptive contributions. Meanwhile, a centroid representation with orthogonality constraint and cluster partition are seamlessly constructed to approximate the latent integral space. An alternate minimizing algorithm is developed to solve the optimization problem, which is proved to have linear time complexity w.r.t. the sample size. Extensive experiments conducted on several realworld datasets confirm the superiority of the proposed AIMC method compared with the state-of-the-art methods.

dataset, integral space, representation, (16 more...)

arXiv.org Artificial Intelligence

2208.12808

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Four Algorithms for Correlation Clustering: A Survey

Jafarov, Jafar

arXiv.org Artificial IntelligenceAug-24-2022

In the Correlation Clustering problem, we are given a set of objects with pairwise similarity information. Our aim is to partition these objects into clusters that match this information as closely as possible. More specifically, the pairwise information is given as a weighted graph $G$ with its edges labelled as ``similar" or ``dissimilar" by a binary classifier. The goal is to produce a clustering that minimizes the weight of ``disagreements": the sum of the weights of similar edges across clusters and dissimilar edges within clusters. In this exposition we focus on the case when $G$ is complete and unweighted. We explore four approximation algorithms for the Correlation Clustering problem under this assumption. In particular, we describe the following algorithms: (i) the $17429-$approximation algorithm by Bansal, Blum, and Chawla, (ii) the $4-$approximation algorithm by Charikar, Guruswami, and Wirth (iii) the $3-$approximation algorithm by Ailon, Charikar, and Newman (iv) the $2.06-$approximation algorithm by Chawla, Makarychev, Schramm, and Yaroslavtsev.

algorithm, disagreement, triangle, (14 more...)

arXiv.org Artificial Intelligence

2208.12636

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.46)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Germany (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Workflow (0.46)
Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.55)

Add feedback