Goto

Collaborating Authors

 Clustering


Efficient Large-Scale Face Clustering Using an Online Mixture of Gaussians

arXiv.org Artificial Intelligence

In this work, we address the problem of large-scale online face clustering: given a continuous stream of unknown faces, create a database grouping the incoming faces by their identity. The database must be updated every time a new face arrives. In addition, the solution must be efficient, accurate and scalable. For this purpose, we present an online gaussian mixture-based clustering method (OGMC). The key idea of this method is the proposal that an identity can be represented by more than just one distribution or cluster. Using feature vectors (f-vectors) extracted from the incoming faces, OGMC generates clusters that may be connected to others depending on their proximity and their robustness. Every time a cluster is updated with a new sample, its connections are also updated. With this approach, we reduce the dependency of the clustering process on the order and the size of the incoming data and we are able to deal with complex data distributions. Experimental results show that the proposed approach outperforms state-of-the-art clustering methods on large-scale face clustering benchmarks not only in accuracy, but also in efficiency and scalability.


Deep adaptive fuzzy clustering for evolutionary unsupervised representation learning

arXiv.org Artificial Intelligence

Cluster assignment of large and complex images is a crucial but challenging task in pattern recognition and computer vision. In this study, we explore the possibility of employing fuzzy clustering in a deep neural network framework. Thus, we present a novel evolutionary unsupervised learning representation model with iterative optimization. It implements the deep adaptive fuzzy clustering (DAFC) strategy that learns a convolutional neural network classifier from given only unlabeled data samples. DAFC consists of a deep feature quality-verifying model and a fuzzy clustering model, where deep feature representation learning loss function and embedded fuzzy clustering with the weighted adaptive entropy is implemented. We joint fuzzy clustering to the deep reconstruction model, in which fuzzy membership is utilized to represent a clear structure of deep cluster assignments and jointly optimize for the deep representation learning and clustering. Also, the joint model evaluates current clustering performance by inspecting whether the re-sampled data from estimated bottleneck space have consistent clustering properties to progressively improve the deep clustering model. Comprehensive experiments on a variety of datasets show that the proposed method obtains a substantially better performance for both reconstruction and clustering quality when compared to the other state-of-the-art deep clustering methods, as demonstrated with the in-depth analysis in the extensive experiments.


Clustering Custom Data Using the K-Means Algorithm -- Python

#artificialintelligence

The K-Means clustering algorithm is an unsupervised learning algorithm meaning that it has no target labels. It is very tricky to choose the best "K" value. But one way of doing it is the elbow method. According to this method, the sum of squared error (SSE) is calculated for some values of "K". The SSE is the sum of the squared distance between each data point of cluster and its centroid.


Model-based clustering of partial records

arXiv.org Machine Learning

In practice, real data sets may have missing values or otherwise have only partially observed records that complicate the validity and application validity of standard statistical methodology. Missingness may result from diverse causes, with an underlying mechanism of one of three types: missing completely at random (MCAR), missing at random (MAR), or not missing at random (NMAR) [16]. Under MCAR, the probability that a case (record, sample, observation) is missing feature (variable, attribute, dimension) values does not depend on either the observed or missing feature values. When the probability that a case is missing feature values may depend on the observed feature values, but not the missing feature values, the mechanism is MAR. In the more extreme and challenging case of NMAR, the probability that a case is missing feature values depends on both observed and missing feature values. Notably, if the data are MCAR, they are also MAR; if the data are not MAR, then they are NMAR. Strategies for analysis of data with missing values are often critically dependent on the missingness mechanism, and clustering is no exception. For clustering problems, the most common (and often expedient) treatment of missing values is deletion, on either a case or feature basis, or imputation [17], [18].


Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data

arXiv.org Machine Learning

This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations between a data object and centroids of all the clusters. To maximize the ICP performance, SIVF exploits for a centroid set an inverted-file that is structured so as to reduce pipeline hazards. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions rather than less similarity calculations.


Multilayer Graph Clustering with Optimized Node Embedding

arXiv.org Artificial Intelligence

We are interested in multilayer graph clustering, which aims at dividing the graph nodes into categories or communities. To do so, we propose to learn a clustering-friendly embedding of the graph nodes by solving an optimization problem that involves a fidelity term to the layers of a given multilayer graph, and a regularization on the (single-layer) graph induced by the embedding. The fidelity term uses the contrastive loss to properly aggregate the observed layers into a representative embedding. The regularization pushes for a sparse and community-aware graph, and it is based on a measure of graph sparsification called "effective resistance", coupled with a penalization of the first few eigenvalues of the representative graph Laplacian matrix to favor the formation of communities. The proposed optimization problem is nonconvex but fully differentiable, and thus can be solved via the descent gradient method. Experiments show that our method leads to a significant improvement w.r.t. state-of-the-art multilayer graph clustering algorithms.


pH-RL: A personalization architecture to bring reinforcement learning to health practice

arXiv.org Artificial Intelligence

While reinforcement learning (RL) has proven to be the approach of choice for tackling many complex problems, it remains challenging to develop and deploy RL agents in real-life scenarios successfully. This paper presents pH-RL (personalization in e-Health with RL) a general RL architecture for personalization to bring RL to health practice. pH-RL allows for various levels of personalization in health applications and allows for online and batch learning. Furthermore, we provide a general-purpose implementation framework that can be integrated with various healthcare applications. We describe a step-by-step guideline for the successful deployment of RL policies in a mobile application. We implemented our open-source RL architecture and integrated it with the MoodBuster mobile application for mental health to provide messages to increase daily adherence to the online therapeutic modules. We then performed a comprehensive study with human participants over a sustained period. Our experimental results show that the developed policies learn to select appropriate actions consistently using only a few days' worth of data. Furthermore, we empirically demonstrate the stability of the learned policies during the study.


Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

arXiv.org Artificial Intelligence

Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with misspellings and different formats, nor do they capture any semantic joins. In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. We embed textual values as high-dimensional vectors and join columns under similarity predicates on high-dimensional vectors, hence to address the limitations of equi-join approaches and identify more meaningful results. To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. A partitioning technique is developed to cope with the case when the data lake is large and the index cannot fit in main memory. An experimental evaluation on real datasets shows that our solution identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks. The experiments also demonstrate the efficiency of the proposed method.


Multiscale Clustering of Hyperspectral Images Through Spectral-Spatial Diffusion Geometry

arXiv.org Machine Learning

Clustering algorithms partition a dataset into groups of similar points. The primary contribution of this article is the Multiscale Spatially-Regularized Diffusion Learning (M-SRDL) clustering algorithm, which uses spatially-regularized diffusion distances to efficiently and accurately learn multiple scales of latent structure in hyperspectral images (HSI). The M-SRDL clustering algorithm extracts clusterings at many scales from an HSI and outputs these clusterings' variation of information-barycenter as an exemplar for all underlying cluster structure. We show that incorporating spatial regularization into a multiscale clustering framework corresponds to smoother and more coherent clusters when applied to HSI data and leads to more accurate clustering labels.


Fully Explained BIRCH Clustering for Outliers with Python

#artificialintelligence

This algorithm is used to perform hierarchical clustering based on trees. These trees are called CFT i.e. The full form of BIRCH is Balanced Iterative Reducing Clusters using Hierarchies. The metric use in this cluster to measure the distance is Euclidean distance measurement. When we get a massive dataset and BIRCH is not fulfilling the requirement because of memory constraints of using the whole dataset then we should consider mini-batches of fixed size from the dataset to get reduced runtime.