AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Efficient Large-Scale Face Clustering Using an Online Mixture of Gaussians

Montero, David, Aginako, Naiara, Sierra, Basilio, Nieto, Marcos

arXiv.org Artificial IntelligenceMar-31-2021

In this work, we address the problem of large-scale online face clustering: given a continuous stream of unknown faces, create a database grouping the incoming faces by their identity. The database must be updated every time a new face arrives. In addition, the solution must be efficient, accurate and scalable. For this purpose, we present an online gaussian mixture-based clustering method (OGMC). The key idea of this method is the proposal that an identity can be represented by more than just one distribution or cluster. Using feature vectors (f-vectors) extracted from the incoming faces, OGMC generates clusters that may be connected to others depending on their proximity and their robustness. Every time a cluster is updated with a new sample, its connections are also updated. With this approach, we reduce the dependency of the clustering process on the order and the size of the incoming data and we are able to deal with complex data distributions. Experimental results show that the proposed approach outperforms state-of-the-art clustering methods on large-scale face clustering benchmarks not only in accuracy, but also in efficiency and scalability.

algorithm, computer vision, experiment, (11 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.engappai.2022.105079

2103.17272

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Spain > Basque Country (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Deep adaptive fuzzy clustering for evolutionary unsupervised representation learning

Tan, Dayu, Huang, Zheng, Peng, Xin, Zhong, Weimin, Mahalec, Vladimir

arXiv.org Artificial IntelligenceMar-31-2021

Cluster assignment of large and complex images is a crucial but challenging task in pattern recognition and computer vision. In this study, we explore the possibility of employing fuzzy clustering in a deep neural network framework. Thus, we present a novel evolutionary unsupervised learning representation model with iterative optimization. It implements the deep adaptive fuzzy clustering (DAFC) strategy that learns a convolutional neural network classifier from given only unlabeled data samples. DAFC consists of a deep feature quality-verifying model and a fuzzy clustering model, where deep feature representation learning loss function and embedded fuzzy clustering with the weighted adaptive entropy is implemented. We joint fuzzy clustering to the deep reconstruction model, in which fuzzy membership is utilized to represent a clear structure of deep cluster assignments and jointly optimize for the deep representation learning and clustering. Also, the joint model evaluates current clustering performance by inspecting whether the re-sampled data from estimated bottleneck space have consistent clustering properties to progressively improve the deep clustering model. Comprehensive experiments on a variety of datasets show that the proposed method obtains a substantially better performance for both reconstruction and clustering quality when compared to the other state-of-the-art deep clustering methods, as demonstrated with the in-depth analysis in the extensive experiments.

cluster assignment, dataset, representation, (15 more...)

arXiv.org Artificial Intelligence

2103.17086

Country:

North America > Canada > Ontario > Hamilton (0.28)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Clustering Custom Data Using the K-Means Algorithm -- Python

#artificialintelligenceMar-30-2021, 03:15:11 GMT

The K-Means clustering algorithm is an unsupervised learning algorithm meaning that it has no target labels. It is very tricky to choose the best "K" value. But one way of doing it is the elbow method. According to this method, the sum of squared error (SSE) is calculated for some values of "K". The SSE is the sum of the squared distance between each data point of cluster and its centroid.

algorithm, clustering custom data, k-means algorithm, (7 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.60)

Add feedback

Model-based clustering of partial records

Goren, Emily M., Maitra, Ranjan

arXiv.org Machine LearningMar-30-2021

In practice, real data sets may have missing values or otherwise have only partially observed records that complicate the validity and application validity of standard statistical methodology. Missingness may result from diverse causes, with an underlying mechanism of one of three types: missing completely at random (MCAR), missing at random (MAR), or not missing at random (NMAR) [16]. Under MCAR, the probability that a case (record, sample, observation) is missing feature (variable, attribute, dimension) values does not depend on either the observed or missing feature values. When the probability that a case is missing feature values may depend on the observed feature values, but not the missing feature values, the mechanism is MAR. In the more extreme and challenging case of NMAR, the probability that a case is missing feature values depends on both observed and missing feature values. Notably, if the data are MCAR, they are also MAR; if the data are not MAR, then they are NMAR. Strategies for analysis of data with missing values are often critically dependent on the missingness mechanism, and clustering is no exception. For clustering problems, the most common (and often expedient) treatment of missing values is deletion, on either a case or feature basis, or imputation [17], [18].

algorithm, iteration, missingness mechanism, (15 more...)

arXiv.org Machine Learning

2103.16336

Country:

North America > United States > Iowa (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > New York (0.04)
Asia > India (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Food & Agriculture (0.93)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data

Aoyama, Kazuo, Saito, Kazumi

arXiv.org Machine LearningMar-30-2021

This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations between a data object and centroids of all the clusters. To maximize the ICP performance, SIVF exploits for a centroid set an inverted-file that is structured so as to reduce pipeline hazards. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions rather than less similarity calculations.

algorithm, feature vector, similarity calculation, (14 more...)

arXiv.org Machine Learning

2103.16141

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
North America > United States > New Jersey (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Multilayer Graph Clustering with Optimized Node Embedding

Gheche, Mireille El, Frossard, Pascal

arXiv.org Artificial IntelligenceMar-30-2021

We are interested in multilayer graph clustering, which aims at dividing the graph nodes into categories or communities. To do so, we propose to learn a clustering-friendly embedding of the graph nodes by solving an optimization problem that involves a fidelity term to the layers of a given multilayer graph, and a regularization on the (single-layer) graph induced by the embedding. The fidelity term uses the contrastive loss to properly aggregate the observed layers into a representative embedding. The regularization pushes for a sparse and community-aware graph, and it is based on a measure of graph sparsification called "effective resistance", coupled with a penalization of the first few eigenvalues of the representative graph Laplacian matrix to favor the formation of communities. The proposed optimization problem is nonconvex but fully differentiable, and thus can be solved via the descent gradient method. Experiments show that our method leads to a significant improvement w.r.t. state-of-the-art multilayer graph clustering algorithms.

graph, laplacian matrix, multilayer graph, (15 more...)

arXiv.org Artificial Intelligence

2103.16534

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > United Kingdom > England > East Sussex > Brighton (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

pH-RL: A personalization architecture to bring reinforcement learning to health practice

Hassouni, Ali el, Hoogendoorn, Mark, Ciharova, Marketa, Kleiboer, Annet, Amarti, Khadicha, Muhonen, Vesa, Riper, Heleen, Eiben, A. E.

arXiv.org Artificial IntelligenceMar-30-2021

While reinforcement learning (RL) has proven to be the approach of choice for tackling many complex problems, it remains challenging to develop and deploy RL agents in real-life scenarios successfully. This paper presents pH-RL (personalization in e-Health with RL) a general RL architecture for personalization to bring RL to health practice. pH-RL allows for various levels of personalization in health applications and allows for online and batch learning. Furthermore, we provide a general-purpose implementation framework that can be integrated with various healthcare applications. We describe a step-by-step guideline for the successful deployment of RL policies in a mobile application. We implemented our open-source RL architecture and integrated it with the MoodBuster mobile application for mental health to provide messages to increase daily adherence to the online therapeutic modules. We then performed a comprehensive study with human participants over a sustained period. Our experimental results show that the developed policies learn to select appropriate actions consistently using only a few days' worth of data. Furthermore, we empirically demonstrate the stability of the learned policies during the study.

application, architecture, personalization, (16 more...)

arXiv.org Artificial Intelligence

2103.15908

Country: Europe > Netherlands > North Holland > Amsterdam (0.05)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.46)

Industry:

Leisure & Entertainment > Games (0.93)
Health & Medicine > Health Care Technology (0.70)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Dong, Yuyang, Takeoka, Kunihiro, Xiao, Chuan, Oyamada, Masafumi

arXiv.org Artificial IntelligenceMar-29-2021

Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with misspellings and different formats, nor do they capture any semantic joins. In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. We embed textual values as high-dimensional vectors and join columns under similarity predicates on high-dimensional vectors, hence to address the limitations of equi-join approaches and identify more meaningful results. To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. A partitioning technique is developed to cope with the case when the data lake is large and the index cannot fit in main memory. An experimental evaluation on real datasets shows that our solution identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks. The experiments also demonstrate the efficiency of the proposed method.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2010.13273

Country:

North America > United States > Alaska (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Multiscale Clustering of Hyperspectral Images Through Spectral-Spatial Diffusion Geometry

Polk, Sam L., Murphy, James M.

arXiv.org Machine LearningMar-29-2021

Clustering algorithms partition a dataset into groups of similar points. The primary contribution of this article is the Multiscale Spatially-Regularized Diffusion Learning (M-SRDL) clustering algorithm, which uses spatially-regularized diffusion distances to efficiently and accurately learn multiple scales of latent structure in hyperspectral images (HSI). The M-SRDL clustering algorithm extracts clusterings at many scales from an HSI and outputs these clusterings' variation of information-barycenter as an exemplar for all underlying cluster structure. We show that incorporating spatial regularization into a multiscale clustering framework corresponds to smoother and more coherent clusters when applied to HSI data and leads to more accurate clustering labels.

diffusion distance, m-srdl, spatially-regularized diffusion learning, (10 more...)

arXiv.org Machine Learning

2103.15783

Country:

North America > United States > Massachusetts > Middlesex County > Medford (0.04)
North America > United States > California (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Fully Explained BIRCH Clustering for Outliers with Python

#artificialintelligenceMar-28-2021, 12:50:11 GMT

This algorithm is used to perform hierarchical clustering based on trees. These trees are called CFT i.e. The full form of BIRCH is Balanced Iterative Reducing Clusters using Hierarchies. The metric use in this cluster to measure the distance is Euclidean distance measurement. When we get a massive dataset and BIRCH is not fulfilling the requirement because of memory constraints of using the whole dataset then we should consider mini-batches of fixed size from the dataset to get reduced runtime.

algorithm, birch, explained birch clustering, (3 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.62)

Add feedback