AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

An Efficient Framework for Clustered Federated Learning

Ghosh, Avishek, Chung, Jichan, Yin, Dong, Ramchandran, Kannan

arXiv.org Machine LearningJun-7-2020

We address the problem of Federated Learning (FL) where users are distributed and partitioned into clusters. This setup captures settings where different groups of users have their own objectives (learning tasks) but by aggregating their data with others in the same cluster (same learning task), they can leverage the strength in numbers in order to perform more efficient Federated Learning. We propose a new framework dubbed the Iterative Federated Clustering Algorithm (IFCA), which alternately estimates the cluster identities of the users and optimizes model parameters for the user clusters via gradient descent. We analyze the convergence rate of this algorithm first in a linear model with squared loss and then for generic strongly convex and smooth loss functions. We show that in both settings, with good initialization, IFCA converges at an exponential rate, and discuss the optimality of the statistical error rate. In the experiments, we show that our algorithm can succeed even if we relax the requirements on initialization with random initialization and multiple restarts. We also present experimental results showing that our algorithm is efficient in non-convex problems such as neural networks and outperforms the baselines on several clustered FL benchmarks created based on the MNIST and CIFAR-10 datasets by $5\sim 8\%$.

artificial intelligence, machine learning, probability, (15 more...)

arXiv.org Machine Learning

2006.04088

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Hybrid Model for Anomaly Detection on Call Detail Records by Time Series Forecasting

Mokhtari, Aryan, Sadighi, Leyla, Bahrak, Behnam, Eshghie, Mojtaba

arXiv.org Machine LearningJun-7-2020

Mobile network operators store an enormous amount of information like log files that describe various events and users' activities. Analysis of these logs might be used in many critical applications such as detecting cyber-attacks, finding behavioral patterns of users, security incident response, network forensics, etc. In a cellular network Call Detail Records (CDR) is one type of such logs containing metadata of calls and usually includes valuable information about contact such as the phone numbers of originating and receiving subscribers, call duration, the area of activity, type of call (SMS or voice call) and a timestamp. With anomaly detection, it is possible to determine abnormal reduction or increment of network traffic in an area or for a particular person. This paper's primary goal is to study subscribers' behavior in a cellular network, mainly predicting the number of calls in a region and detecting anomalies in the network traffic. In this paper, a new hybrid method is proposed based on various anomaly detection methods such as GARCH, K-means, and Neural Network to determine the anomalous data. Moreover, we have discussed the possible causes of such anomalies.

anomaly, anomaly detection, detection, (12 more...)

arXiv.org Machine Learning

2006.04101

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.06)
North America > United States > New Jersey (0.04)
Europe > Sweden (0.04)
(2 more...)

Genre: Research Report (0.85)

Industry:

Telecommunications (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Learning pose variations within shape population by constrained mixtures of factor analyzers

Wang, Xilu

arXiv.org Machine LearningJun-7-2020

Mining and learning the shape variability of underlying population has benefited the applications including parametric shape modeling, 3D animation, and image segmentation. The current statistical shape modeling method works well on learning unstructured shape variations without obvious pose changes (relative rotations of the body parts). Studying the pose variations within a shape population involves segmenting the shapes into different articulated parts and learning the transformations of the segmented parts. This paper formulates the pose learning problem as mixtures of factor analyzers. The segmentation is obtained by components posterior probabilities and the rotations in pose variations are learned by the factor loading matrices. To guarantee that the factor loading matrices are composed by rotation matrices, constraints are imposed and the corresponding closed form optimal solution is derived. Based on the proposed method, the pose variations are automatically learned from the given shape populations. The method is applied in motion animation where new poses are generated by interpolating the existing poses in the training set. The obtained results are smooth and realistic.

artificial intelligence, factor analyzer, machine learning, (18 more...)

arXiv.org Machine Learning

2006.04171

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > South Carolina (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.64)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.

#artificialintelligenceJun-6-2020, 15:21:30 GMT

This article the idea of building a new, data driven classification of companies based on their financials, instead of the type of business they do. The premise of this study is based on the idea that great stocks to buy would be superior among their peers. In this context, I define a peer group as a group of companies that have similar financial structure (e.g. If you haven't read an introduction, I suggest you read my preface to this study here: In the previous article, I've selected 5 dimensions to define financial structure of a company: I've also made sure that these dimensions are independent. This is an important consideration since I want to measure how a company is doing at each independent front regardless of its performance elsewhere.

algorithm, artificial intelligence, machine learning, (11 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.52)

Add feedback

An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

Dorman, Karin S., Maitra, Ranjan

arXiv.org Machine LearningJun-6-2020

Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient, distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. We provide a novel, computationally efficient implementation of k-modes, called OTQT. We prove that OTQT finds updates, undetectable to existing k-modes algorithms, that improve the objective function. Thus, although slightly slower per iteration owing to its algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. As a result, we recommend OTQT as the preferred, default algorithm for all k-modes implementations. We also examine five initialization methods and three types of K-selection methods, many of them novel or novel applications to k-modes. By examining performance on real and simulated datasets, we show that simple random initialization is the best initializer and that a novel K-selection method is more accurate than methods adapted from k-means. Identifying groups of similar observations in datasets is common in a wide array of applications, with many clustering methods developed in statistics, machine learning and the applied sciences [1]-[7]. The k-means algorithm [8]-[11] is arguably the most popular method for clustering numerical-valued observations. It scales to large datasets because it does not require calculation of all pairwise distances, and it is distribution-free. While distribution-free does not imply it is assumption-free [12], [13], it is a starting place for users wary of making assumptions about their data. Unfortunately, k-means does not provide an appropriate objective to minimize for datasets with categorical attributes.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2006.03936

Country:

North America > United States > Iowa > Story County > Ames (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Virginia > Fairfax County > McLean (0.04)
(11 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.45)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Food & Agriculture (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Learning Inconsistent Preferences with Kernel Methods

Chau, Siu Lun, González, Javier, Sejdinovic, Dino

arXiv.org Machine LearningJun-6-2020

We propose a probabilistic kernel approach for preferential learning from pairwise duelling data using Gaussian Processes. Different from previous methods, we do not impose a total order on the item space, hence can capture more expressive latent preferential structures such as inconsistent preferences and clusters of comparable items. Furthermore, we prove the universality of the proposed kernels, i.e. that the corresponding reproducing kernel Hilbert Space (RKHS) is dense in the space of skew-symmetric preference functions. To conclude the paper, we provide an extensive set of numerical experiments on simulated and real-world datasets showcasing the competitiveness of our proposed method with state-of-the-art.

artificial intelligence, kernel, machine learning, (16 more...)

arXiv.org Machine Learning

2006.03847

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.46)

Industry: Leisure & Entertainment > Sports > Football (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Improving k-Means Clustering Performance with Disentangled Internal Representations

Agarap, Abien Fred, Azcarraga, Arnulfo P.

arXiv.org Machine LearningJun-5-2020

Deep clustering algorithms combine representation learning and clustering by jointly optimizing a clustering loss and a non-clustering loss. In such methods, a deep neural network is used for representation learning together with a clustering network. Instead of following this framework to improve clustering performance, we propose a simpler approach of optimizing the entanglement of the learned latent code representation of an autoencoder. We define entanglement as how close pairs of points from the same class or structure are, relative to pairs of points from different classes or structures. To measure the entanglement of data points, we use the soft nearest neighbor loss, and expand it by introducing an annealing temperature factor. Using our proposed approach, the test clustering accuracy was 96.2% on the MNIST dataset, 85.6% on the Fashion-MNIST dataset, and 79.2% on the EMNIST Balanced dataset, outperforming our baseline models.

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Machine Learning

2006.04535

Country: Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Handling missing data in model-based clustering

Serafini, Alessio, Murphy, Thomas Brendan, Scrucca, Luca

arXiv.org Machine LearningJun-4-2020

Gaussian Mixture models (GMMs) are a powerful tool for clustering, classification and density estimation when clustering structures are embedded in the data. The presence of missing values can largely impact the GMMs estimation process, thus handling missing data turns out to be a crucial point in clustering, classification and density estimation. Several techniques have been developed to impute the missing values before model estimation. Among these, multiple imputation is a simple and useful general approach to handle missing data. In this paper we propose two different methods to fit Gaussian mixtures in the presence of missing data. Both methods use a variant of the Monte Carlo Expectation-Maximisation (MCEM) algorithm for data augmentation. Thus, multiple imputations are performed during the E-step, followed by the standard M-step for a given eigen-decomposed component-covariance matrix. We show that the proposed methods outperform the multiple imputation approach, both in terms of clusters identification and density estimation.

artificial intelligence, data quality, machine learning, (17 more...)

arXiv.org Machine Learning

2006.02954

Country:

Europe > Austria > Vienna (0.14)
Asia > Middle East > Jordan (0.05)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.64)

Add feedback

An Alternative Metric for Detecting Anomalous Ship Behavior Using a Variation of the DBSCAN Clustering Algorithm

Botts, Carsten

arXiv.org Machine LearningJun-2-2020

There is a growing need to quickly and accurately identify anomalous behavior in ships. This paper applies a variation of the Density Based Spatial Clustering Among Noise (DBSCAN) algorithm to identify such anomalous behavior given a ship's Automatic Identification System (AIS) data. This variation of the DBSCAN algorithm has been previously introduced in the literature, and in this study, we elucidate and explore the mathematical details of this algorithm and introduce an alternative anomaly metric which is more statistically informative than the one previously suggested.

algorithm, anom bott, anom liu, (12 more...)

arXiv.org Machine Learning

2006.01936

Country:

North America > United States > New York > Richmond County > New York City (0.04)
North America > United States > New York > Queens County > New York City (0.04)
North America > United States > New York > New York County > New York City (0.04)
(6 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.64)

Add feedback

Inductive Geometric Matrix Midranges

Van Goffrier, Graham W., Mostajeran, Cyrus, Sepulchre, Rodolphe

arXiv.org Machine LearningJun-2-2020

Covariance data as represented by symmetric positive definite (SPD) matrices are ubiquitous throughout technical study as efficient descriptors of interdependent systems. Euclidean analysis of SPD matrices, while computationally fast, can lead to skewed and even unphysical interpretations of data. Riemannian methods preserve the geometric structure of SPD data at the cost of expensive eigenvalue computations. In this paper, we propose a geometric method for unsupervised clustering of SPD data based on the Thompson metric. This technique relies upon a novel "inductive midrange" centroid computation for SPD data, whose properties are examined and numerically confirmed. We demonstrate the incorporation of the Thompson metric and inductive midrange into X-means and K-means++ clustering algorithms.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2006.01508

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback