AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A Predictive Model using Machine Learning Algorithm in Identifying Students Probability on Passing Semestral Course

Doctor, Anabella C.

arXiv.org Artificial IntelligenceApr-11-2023

This study aims to determine a predictive model to learn students probability to pass their courses taken at the earliest stage of the semester. To successfully discover a good predictive model with high acceptability, accurate, and precision rate which delivers a useful outcome for decision making in education systems, in improving the processes of conveying knowledge and uplifting students academic performance, the proponent applies and strictly followed the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology. This study employs classification for data mining techniques, and decision tree for algorithm. With the utilization of the newly discovered predictive model, the prediction of students probabilities to pass the current courses they take gives 0.7619 accuracy, 0.8333 precision, 0.8823 recall, and 0.8571 f1 score, which shows that the model used in the prediction is reliable, accurate, and recommendable. Considering the indicators and the results, it can be noted that the prediction model used in this study is highly acceptable. The data mining techniques provides effective and efficient innovative tools in analyzing and predicting student performances. The model used in this study will greatly affect the way educators understand and identify the weakness of their students in the class, the way they improved the effectiveness of their learning processes gearing to their students, bring down academic failure rates, and help institution administrators modify their learning system outcomes. Further study for the inclusion of some students demographic information, vast amount of data within the dataset, automated and manual process of predictive criteria indicators where the students can regulate to which criteria, they must improve more for them to pass their courses taken at the end of the semester as early as midterm period are highly needed.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.25147/ijcsr.2017.001.1.135

2304.05565

Country:

Asia > Philippines > Luzon > Calabarzon > Province of Cavite (0.14)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Materials > Metals & Mining (1.00)
Education > Educational Setting (1.00)
Education > Assessment & Standards > Student Performance (1.00)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
(2 more...)

Add feedback

Robust k-means: a Theoretical Revisit

Neural Information Processing SystemsApr-10-2023, 09:27:02 GMT

Over the last years, many variations of the quadratic k-means clustering procedure have been proposed, all aiming to robustify the performance of the algorithm in the presence of outliers. In general terms, two main approaches have been developed: one based on penalized regularization methods, and one based on trimming functions. In this work, we present a theoretical analysis of the robustness and consistency properties of a variant of the classical quadratic k-means algorithm, the robust k-means, which borrows ideas from outlier detection in regression. We show that two outliers in a dataset are enough to breakdown this clustering procedure. However, if we focus on "well-structured" datasets, then robust k-means can recover the underlying cluster structure in spite of the outliers. Finally, we show that, with slight modifications, the most general non-asymptotic results for consistency of quadratic k-means remain valid for this robust variant.

dataset, outlier, rkm, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Greece (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

DASS Good: Explainable Data Mining of Spatial Cohort Data

Wentzel, Andrew, Floricel, Carla, Canahuate, Guadalupe, Naser, Mohamed A., Mohamed, Abdallah S., Fuller, Clifton David, van Dijk, Lisanne, Marai, G. Elisabeta

arXiv.org Artificial IntelligenceApr-10-2023

Developing applicable clinical machine learning models is a difficult task when the data includes spatial information, for example, radiation dose distributions across adjacent organs at risk. We describe the co-design of a modeling system, DASS, to support the hybrid human-machine development and validation of predictive models for estimating long-term toxicities related to radiotherapy doses in head and neck cancer patients. Developed in collaboration with domain experts in oncology and data mining, DASS incorporates human-in-the-loop visual steering, spatial data, and explainable AI to augment domain knowledge with automatic data mining. We demonstrate DASS with the development of two practical clinical stratification models and report feedback from domain experts. Finally, we describe the design lessons learned from this collaborative experience.

data mining, machine learning, organ, (20 more...)

arXiv.org Artificial Intelligence

2304.0487

Country:

North America > United States > Texas (0.04)
North America > United States > Iowa (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry: Health & Medicine > Therapeutic Area > Oncology > Head & Neck Cancer (0.55)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Image deduplication using OpenAI's CLIP and Community Detection

#artificialintelligenceApr-9-2023, 16:55:42 GMT

A short guide on how to use image embeddings from OpenAI's CLIP and clustering techniques in order to group near-duplicate images together. CLIP is trained by trying to align image text embedding pairs, or "learning visual representations from natural language supervision". You can use it's text or image embeddings to accomplish a lot of different tasks, such as zero-shot image classification! It's embeddings are pretty powerful. For this task, we're going to use the AirBnB Duplicate Image Dataset, available on Kaggle.

clip and community detection, image deduplication, openai, (2 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.77)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.63)

Add feedback

Restructuring Graph for Higher Homophily via Adaptive Spectral Clustering

Li, Shouheng, Kim, Dongwoo, Wang, Qing

arXiv.org Artificial IntelligenceApr-9-2023

While a growing body of literature has been studying new Graph Neural Networks (GNNs) that work on both homophilic and heterophilic graphs, little has been done on adapting classical GNNs to less-homophilic graphs. Although the ability to handle less-homophilic graphs is restricted, classical GNNs still stand out in several nice properties such as efficiency, simplicity, and explainability. In this work, we propose a novel graph restructuring method that can be integrated into any type of GNNs, including classical GNNs, to leverage the benefits of existing GNNs while alleviating their limitations. Our contribution is threefold: a) learning the weight of pseudo-eigenvectors for an adaptive spectral clustering that aligns well with known node labels, b) proposing a new density-aware homophilic metric that is robust to label imbalance, and c) reconstructing the adjacency matrix based on the result of adaptive spectral clustering to maximize the homophilic scores. The experimental results show that our graph restructuring method can significantly boost the performance of six classical GNNs by an average of 25% on less-homophilic graphs. The boosted performance is comparable to state-of-the-art methods.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2206.02386

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.04)
Asia > Middle East > Jordan (0.04)
Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Optimal Sparse Regression Trees

Zhang, Rui, Xin, Rui, Seltzer, Margo, Rudin, Cynthia

arXiv.org Artificial IntelligenceApr-9-2023

Regression trees are one of the oldest forms of AI models, and their predictions can be made without a calculator, which makes them broadly useful, particularly for high-stakes applications. Within the large literature on regression trees, there has been little effort towards full provable optimization, mainly due to the computational hardness of the problem. This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees. We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm in 1-dimension over the set of labels. We are often able to find optimal sparse trees in seconds, even for challenging datasets that involve large numbers of samples and highly-correlated features.

artificial intelligence, machine learning, optimization problem, (19 more...)

arXiv.org Artificial Intelligence

2211.1498

Country:

Asia > South Korea > Seoul > Seoul (0.05)
North America > United States > Massachusetts (0.04)
North America > Canada > British Columbia (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos

Lu, Shiyang, Deng, Yunfu, Boularias, Abdeslam, Bekris, Kostas

arXiv.org Artificial IntelligenceApr-9-2023

This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin.

artificial intelligence, machine learning, segmentation, (18 more...)

arXiv.org Artificial Intelligence

2304.04325

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New Jersey > Middlesex County > New Brunswick (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Semantic-Enhanced Image Clustering

Cai, Shaotian, Qiu, Liping, Chen, Xiaojun, Zhang, Qin, Chen, Longteng

arXiv.org Artificial IntelligenceApr-8-2023

Image clustering is an important and open-challenging task in computer vision. Although many methods have been proposed to solve the image clustering task, they only explore images and uncover clusters according to the image features, thus being unable to distinguish visually similar but semantically different images. In this paper, we propose to investigate the task of image clustering with the help of a visual-language pre-training model. Different from the zero-shot setting, in which the class names are known, we only know the number of clusters in this setting. Therefore, how to map images to a proper semantic space and how to cluster images from both image and semantic spaces are two key problems. To solve the above problems, we propose a novel image clustering method guided by the visual-language pre-training model CLIP, named \textbf{Semantic-Enhanced Image Clustering (SIC)}. In this new method, we propose a method to map the given images to a proper semantic space first and efficient methods to generate pseudo-labels according to the relationships between images and semantics. Finally, we propose performing clustering with consistency learning in both image space and semantic space, in a self-supervised learning fashion. The theoretical result of convergence analysis shows that our proposed method can converge at a sublinear speed. Theoretical analysis of expectation risk also shows that we can reduce the expected risk by improving neighborhood consistency, increasing prediction confidence, or reducing neighborhood imbalance. Experimental results on five benchmark datasets clearly show the superiority of our new method.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2208.09849

Country: Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

DiscoVars: A New Data Analysis Perspective -- Application in Variable Selection for Clustering

Demiriz, Ayhan

arXiv.org Artificial IntelligenceApr-8-2023

We present a new data analysis perspective to determine variable importance regardless of the underlying learning task. Traditionally, variable selection is considered an important step in supervised learning for both classification and regression problems. The variable selection also becomes critical when costs associated with the data collection and storage are considerably high for cases like remote sensing. Therefore, we propose a new methodology to select important variables from the data by first creating dependency networks among all variables and then ranking them (i.e. nodes) by graph centrality measures. Selecting Top-$n$ variables according to preferred centrality measure will yield a strong candidate subset of variables for further learning tasks e.g. clustering. We present our tool as a Shiny app which is a user-friendly interface development environment. We also extend the user interface for two well-known unsupervised variable selection methods from literature for comparison reasons.

artificial intelligence, machine learning, selection, (18 more...)

arXiv.org Artificial Intelligence

2304.03983

Country:

North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine (1.00)
Banking & Finance > Trading (0.30)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)

Add feedback

FedPNN: One-shot Federated Classification via Evolving Clustering Method and Probabilistic Neural Network hybrid

Prasad, Polaki Durga, Vivek, Yelleti, Ravi, Vadlamani

arXiv.org Artificial IntelligenceApr-8-2023

Protecting data privacy is paramount in the fields such as finance, banking, and healthcare. Federated Learning (FL) has attracted widespread attention due to its decentralized, distributed training and the ability to protect the privacy while obtaining a global shared model. However, FL presents challenges such as communication overhead, and limited resource capability. This motivated us to propose a two-stage federated learning approach toward the objective of privacy protection, which is a first-of-its-kind study as follows: (i) During the first stage, the synthetic dataset is generated by employing two different distributions as noise to the vanilla conditional tabular generative adversarial neural network (CTGAN) resulting in modified CTGAN, and (ii) In the second stage, the Federated Probabilistic Neural Network (FedPNN) is developed and employed for building globally shared classification model. We also employed synthetic dataset metrics to check the quality of the generated synthetic dataset. Further, we proposed a meta-clustering algorithm whereby the cluster centers obtained from the clients are clustered at the server for training the global model. Despite PNN being a one-pass learning classifier, its complexity depends on the training data size. Therefore, we employed a modified evolving clustering method (ECM), another one-pass algorithm to cluster the training data thereby increasing the speed further. Moreover, we conducted sensitivity analysis by varying Dthr, a hyperparameter of ECM at the server and client, one at a time. The effectiveness of our approach is validated on four finance and medical datasets.

artificial intelligence, dataset, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2304.04147

Country:

Asia > India (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback