AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

K-Means Clustering in R: Step-by-Step Example

#artificialintelligenceAug-23-2021, 15:41:44 GMT

Clustering is a technique in machine learning that attempts to find clusters of observations within a dataset. The goal is to find clusters such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other. Clustering is a form of unsupervised learning because we're simply attempting to find structure within a dataset rather than predicting the value of some response variable. When this information is available, clustering can be used to identify households that are similar and may be more likely to purchase certain products or respond better to a certain type of advertising. One of the most common forms of clustering is known as k-means clustering.

clustering, dataset, gap statistic, (8 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Beginners guide to k-Means Clustering - Analytics Vidhya

#artificialintelligenceAug-23-2021, 12:10:44 GMT

The very first clustering algorithm that most people get exposed to is k-Means clustering. This is probably because it is very simple to understand, however, it has several disadvantages which I will mention later. Clustering is generally viewed as an unsupervised method, so it is difficult to establish a good performance metric. However, a lot of useful information can be extrapolated from this algorithm. The problem is how to assign semantics to each cluster and thus measure the "performance" of your algorithm.

algorithm, artificial intelligence, machine learning, (4 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Flexible Clustered Federated Learning for Client-Level Data Distribution Shift

Duan, Moming, Liu, Duo, Ji, Xinyuan, Wu, Yu, Liang, Liang, Chen, Xianzhang, Tan, Yujuan

arXiv.org Artificial IntelligenceAug-22-2021

Federated Learning (FL) enables the multiple participating devices to collaboratively contribute to a global neural network model while keeping the training data locally. Unlike the centralized training setting, the non-IID, imbalanced (statistical heterogeneity) and distribution shifted training data of FL is distributed in the federated network, which will increase the divergences between the local models and the global model, further degrading performance. In this paper, we propose a flexible clustered federated learning (CFL) framework named FlexCFL, in which we 1) group the training of clients based on the similarities between the clients' optimization directions for lower training divergence; 2) implement an efficient newcomer device cold start mechanism for framework scalability and practicality; 3) flexibly migrate clients to meet the challenge of client-level data distribution shift. FlexCFL can achieve improvements by dividing joint optimization into groups of sub-optimization and can strike a balance between accuracy and communication efficiency in the distribution shift environment. The convergence and complexity are analyzed to demonstrate the efficiency of FlexCFL. We also evaluate FlexCFL on several open datasets and made comparisons with related CFL frameworks. The results show that FlexCFL can significantly improve absolute test accuracy by +10.6% on FEMNIST compared to FedAvg, +3.5% on FashionMNIST compared to FedProx, +8.4% on MNIST compared to FeSEM. The experiment results show that FlexCFL is also communication efficient in the distribution shift environment.

accuracy, distribution shift, flexcfl, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TPDS.2021.3134263

2108.09749

Country: Asia > China > Chongqing Province > Chongqing (0.04)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Rainfall-runoff prediction using a Gustafson-Kessel clustering based Takagi-Sugeno Fuzzy model

Dey, Subhrasankha, Dam, Tanmoy

arXiv.org Artificial IntelligenceAug-22-2021

A rainfall-runoff model predicts surface runoff either using a physically-based approach or using a systems-based approach. Takagi-Sugeno (TS) Fuzzy models are systems-based approaches and a popular modeling choice for hydrologists in recent decades due to several advantages and improved accuracy in prediction over other existing models. In this paper, we propose a new rainfall-runoff model developed using Gustafson-Kessel (GK) clustering-based TS Fuzzy model. We present comparative performance measures of GK algorithms with two other clustering algorithms: (i) Fuzzy C-Means (FCM), and (ii)Subtractive Clustering (SC). Our proposed TS Fuzzy model predicts surface runoff using: (i) observed rainfall in a drainage basin and (ii) previously observed precipitation flow in the basin outlet. The proposed model is validated using the rainfall-runoff data collected from the sensors installed on the campus of the Indian Institute of Technology, Kharagpur. The optimal number of rules of the proposed model is obtained by different validation indices. A comparative study of four performance criteria: RootMean Square Error (RMSE), Coefficient of Efficiency (CE), Volumetric Error (VE), and Correlation Coefficient of Determination(R) have been quantitatively demonstrated for each clustering algorithm.

algorithm, fuzzy model, optimal number, (11 more...)

arXiv.org Artificial Intelligence

2108.09684

Country:

Asia > India > West Bengal > Kharagpur (0.24)
Oceania > Australia > New South Wales (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Mastering Clustering with a Segmentation Problem - KDnuggets

#artificialintelligenceAug-21-2021, 03:53:24 GMT

In the current age, the availability of granular data for a large pool of customers/products and technological capability to handle petabytes of data efficiently is growing rapidly. Due to this, it's now possible to come up with very strategic and meaningful clusters for effective targeting. And identifying the target segments requires a robust segmentation exercise. In this blog, we will be discussing the most popular algorithms for unsupervised clustering algorithms and how to implement them in python. In this blog, we will be working with clickstream data from an online store offering clothing for pregnant women.

calinski harabasz score, optimal number, silhouette score, (9 more...)

#artificialintelligence

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Towards Personalized and Human-in-the-Loop Document Summarization

Ghodratnama, Samira

arXiv.org Artificial IntelligenceAug-21-2021

The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.

automatic intelligent feature engineering, computational natural language learning, iot-enabled process data analytic pipeline, (12 more...)

arXiv.org Artificial Intelligence

2108.09443

Country:

Oceania > Australia > New South Wales > Sydney (0.14)
Europe > Czechia > Prague (0.04)
North America > United States > New York (0.04)
(22 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Education (0.92)
(7 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Web (1.00)
Information Technology > Communications > Social Media (1.00)
(16 more...)

Add feedback

Combining K-means type algorithms with Hill Climbing for Joint Stratification and Sample Allocation Designs

O'Luing, Mervyn, Prestwich, Steven, Tarim, S. Armagan

arXiv.org Machine LearningAug-18-2021

In this paper we combine the k-means and/or k-means type algorithms with a hill climbing algorithm in stages to solve the joint stratification and sample allocation problem. This is a combinatorial optimisation problem in which we search for the optimal stratification from the set of all possible stratifications of basic strata. Each stratification being a solution the quality of which is measured by its cost. This problem is intractable for larger sets. Furthermore evaluating the cost of each solution is expensive. A number of heuristic algorithms have already been developed to solve this problem with the aim of finding acceptable solutions in reasonable computation times. However, the heuristics for these algorithms need to be trained in order to optimise performance in each instance. We compare the above multi-stage combination of algorithms with three recent algorithms and report the solution costs, evaluation times and training times. The multi-stage combinations generally compare well with the recent algorithms both in the case of atomic and continuous strata and provide the survey designer with a greater choice of algorithms to choose from.

algorithm, sample size, strata, (12 more...)

arXiv.org Machine Learning

2108.08038

Country:

Europe > Austria > Vienna (0.14)
Europe > Ireland > Munster > County Cork > Cork (0.04)
North America > United States > New York (0.04)
North America > United States > California > Alameda County > Oakland (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Centroid Neural Network: An Efficient and Stable Clustering Algorithm

#artificialintelligenceAug-17-2021, 00:10:18 GMT

Generally, clustering is grouping multi-dimensional datasets into closely related groups. Classical representatives of clustering algorithms are K-means Clustering and Self-Organizing Map (SOM). You can easily find numerous resources for those algorithm explanations. This time, let me introduce to all of you another efficient clustering algorithm but seemingly no many researchers pay attention: Centroid Neural Network for Unsupervised Competitive Learning. Please click here to have a closer looking at the original paper.

algorithm, centroid neural network, efficient and stable clustering algorithm, (10 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Fully Explained OPTICS Clustering with Python Example

#artificialintelligenceAug-17-2021, 00:10:08 GMT

As we know that Clustering is a powerful unsupervised knowledge discovery tool used nowadays to segment our data points into groups of similar features types. However, each algorithm of clustering works according to the parameters. Similarity-based techniques (K-means clustering algorithm working is based on similarity of the data points and is tasked with designating how many clusters are available, while hierarchical clustering algorithms decide when to assign finished clusters manually. Generally used density-based clustering technique is DBSCAN which requires two parameters about how it defines its Core Points, but finding the parameters is an extremely difficult task. DBSCAN's relatively algorithm is called OPTICS (Ordering Points to Identify Cluster Structure).

dbscan, optic, reachability distance, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Clustering dynamics on graphs: from spectral clustering to mean shift through Fokker-Planck interpolation

Craig, Katy, Trillos, Nicolás García, Slepčev, Dejan

arXiv.org Machine LearningAug-17-2021

In this work we build a unifying framework to interpolate between density-driven and geometry-based algorithms for data clustering, and specifically, to connect the mean shift algorithm with spectral clustering at discrete and continuum levels. We seek this connection through the introduction of Fokker-Planck equations on data graphs. Besides introducing new forms of mean shift algorithms on graphs, we provide new theoretical insights on the behavior of the family of diffusion maps in the large sample limit as well as provide new connections between diffusion maps and mean shift dynamics on a fixed graph. Several numerical examples illustrate our theoretical findings and highlight the benefits of interpolating density-driven and geometry-based clustering algorithms.

equation, graph, mean shift, (13 more...)

arXiv.org Machine Learning

2108.08687

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)

Genre:

Research Report (0.50)
Instructional Material (0.45)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback