AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A Semi-supervised Approach for Activity Recognition from Indoor Trajectory Data

Rana, Mashud, Rahman, Ashfaqur, Smith, Daniel

arXiv.org Artificial IntelligenceJan-10-2023

The increasingly wide usage of location aware sensors has made it possible to collect large volume of trajectory data in diverse application domains. Machine learning allows to study the activities or behaviours of moving objects (e.g., people, vehicles, robot) using such trajectory data with rich spatiotemporal information to facilitate informed strategic and operational decision making. In this study, we consider the task of classifying the activities of moving objects from their noisy indoor trajectory data in a collaborative manufacturing environment. Activity recognition can help manufacturing companies to develop appropriate management policies, and optimise safety, productivity, and efficiency. We present a semi-supervised machine learning approach that first applies an information theoretic criterion to partition a long trajectory into a set of segments such that the object exhibits homogeneous behaviour within each segment. The segments are then labelled automatically based on a constrained hierarchical clustering method. Finally, a deep learning classification model based on convolutional neural networks is trained on trajectory segments and the generated pseudo labels. The proposed approach has been evaluated on a dataset containing indoor trajectories of multiple workers collected from a tricycle assembly workshop. The proposed approach is shown to achieve high classification accuracy (F-score varies between 0.81 to 0.95 for different trajectories) using only a small proportion of labelled trajectory segments.

artificial intelligence, machine learning, trajectory, (19 more...)

arXiv.org Artificial Intelligence

2301.03134

Country:

Oceania > Australia (0.04)
Asia > China (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology (0.93)
Automobiles & Trucks (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A review of clustering models in educational data science towards fairness-aware learning

Quy, Tai Le, Friege, Gunnar, Ntoutsi, Eirini

arXiv.org Artificial IntelligenceJan-9-2023

Ensuring fairness is essential for every education system. Machine learning is increasingly supporting the education system and educational data science (EDS) domain, from decision support to educational activities and learning analytics. However, the machine learning-based decisions can be biased because the algorithms may generate the results based on students' protected attributes such as race or gender. Clustering is an important machine learning technique to explore student data in order to support the decision-maker, as well as support educational activities, such as group assignments. Therefore, ensuring high-quality clustering models along with satisfying fairness constraints are important requirements. This chapter comprehensively surveys clustering models and their fairness in EDS. We especially focus on investigating the fair clustering models applied in educational activities. It is believed that these models are practical tools for analyzing students' data and ensuring fairness in EDS.

artificial intelligence, machine learning, student, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-981-99-0026-8_2

2301.03421

Country:

Asia > Middle East > Jordan (0.04)
Europe > Germany > Lower Saxony > Hanover (0.04)
Asia > Indonesia (0.04)
(13 more...)

Genre:

Research Report (1.00)
Instructional Material > Online (0.93)
Instructional Material > Course Syllabus & Notes (0.67)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Online (1.00)
Education > Educational Setting > Higher Education (1.00)
(3 more...)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(2 more...)

Add feedback

Stars: Tera-Scale Graph Building for Clustering and Graph Learning

Carey, CJ, Halcrow, Jonathan, Jayaram, Rajesh, Mirrokni, Vahab, Schudy, Warren, Zhong, Peilin

arXiv.org Artificial IntelligenceJan-9-2023

A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.

artificial intelligence, graph, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2212.02635

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Privacy-Preserving Record Linkage for Cardinality Counting

Wu, Nan, Vatsalan, Dinusha, Kaafar, Mohamed Ali, Ramesh, Sanath Kumar

arXiv.org Artificial IntelligenceJan-9-2023

Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget {\epsilon} = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.

bloom filter, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2301.04

Country:

North America > United States > North Carolina (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Fair Clustering Under a Bounded Cost

Esmaeili, Seyed A., Brubach, Brian, Srinivasan, Aravind, Dickerson, John P.

arXiv.org Artificial IntelligenceJan-8-2023

Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the ''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms.

artificial intelligence, machine learning, objective, (18 more...)

arXiv.org Artificial Intelligence

2106.07239

Country:

North America > United States > Maryland (0.04)
North America > United States > Texas (0.04)
North America > United States > California > Orange County > Irvine (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Industry:

Government > Regional Government > North America Government > United States Government (0.92)
Health & Medicine (0.67)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Community detection in multiplex networks based on orthogonal nonnegative matrix tri-factorization

Ortiz-Bouza, Meiby, Aviyente, Selin

arXiv.org Artificial IntelligenceJan-8-2023

Networks are commonly used to model complex systems. The different entities in the system are represented by nodes of the network and their interactions by edges. In most real life systems, the different entities may interact in different ways necessitating the use of multiplex networks where multiple links are used to model the interactions. One of the major tools for inferring network topology is community detection. Although there are numerous works on community detection in single-layer networks, existing community detection methods for multiplex networks mostly learn a common community structure across layers and do not take the heterogeneity across layers into account. In this paper, we introduce a new multiplex community detection method that identifies communities that are common across layers as well as those that are unique to each layer. The proposed method, Multiplex Orthogonal Nonnegative Matrix Tri-Factorization, represents the adjacency matrix of each layer as the sum of two low-rank matrix factorizations corresponding to the common and private communities, respectively. Unlike most of the existing methods, which require the number of communities to be pre-determined, the proposed method also introduces a two stage method to determine the number of common and private communities. The proposed algorithm is evaluated on synthetic and real multiplex networks, as well as for multiview clustering applications, and compared to state-of-the-art techniques.

community structure, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2205.00626

Country:

North America > United States > Montana (0.04)
North America > United States > Michigan > Ingham County > Lansing (0.04)
North America > United States > Michigan > Ingham County > East Lansing (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.70)

Industry: Law (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

k-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality

Joshi, Devvrat, Thakkar, Janvi

arXiv.org Artificial IntelligenceJan-7-2023

In today's data-driven world, the sensitivity of information has been a significant concern. With this data and additional information on the person's background, one can easily infer an individual's private data. Many differentially private iterative algorithms have been proposed in interactive settings to protect an individual's privacy from these inference attacks. The existing approaches adapt the method to compute differentially private(DP) centroids by iterative Llyod's algorithm and perturbing the centroid with various DP mechanisms. These DP mechanisms do not guarantee convergence of differentially private iterative algorithms and degrade the quality of the cluster. Thus, in this work, we further extend the previous work on 'Differentially Private k-Means Clustering With Convergence Guarantee' by taking it as our baseline. The novelty of our approach is to sub-cluster the clusters and then select the centroid which has a higher probability of moving in the direction of the future centroid. At every Lloyd's step, the centroids are injected with the noise using the exponential DP mechanism. The results of the experiments indicate that our approach outperforms the current state-of-the-art method, i.e., the baseline algorithm, in terms of clustering quality while maintaining the same differential privacy requirements. The clustering quality significantly improved by 4.13 and 2.83 times than baseline for the Wine and Breast_Cancer dataset, respectively.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2301.02896

Country:

Asia > India > Gujarat > Gandhinagar (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Texas (0.04)
North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.72)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Randomized Greedy Algorithms and Composable Coreset for k-Center Clustering with Outliers

Ding, Hu, Huang, Ruomin, Liu, Kai, Yu, Haikuo, Wang, Zixiu

arXiv.org Artificial IntelligenceJan-7-2023

In this paper, we study the problem of k-center clustering with outliers. The problem has many important applications in real world, but the presence of outliers can significantly increase the computational complexity. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez's algorithm, that was developed for solving the ordinary k-center clustering problem. Based on some novel observations, we show that a simple randomized version of this greedy strategy actually can handle outliers efficiently. We further show that this randomized greedy approach also yields small coreset for the problem in doubling metrics (even if the doubling dimension is not given), which can greatly reduce the computational complexity. Moreover, together with the partial clustering framework proposed by Guha et al. (2019), we prove that our coreset method can be applied to distributed data with a low communication complexity. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower complexities comparing with the existing methods.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2301.02814

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.05)
Asia > China > Anhui Province (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(8 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Hierarchical Clustering: A Practical Introduction of Agglomerative and Divisive Methods

#artificialintelligenceJan-6-2023, 05:30:49 GMT

In this article, we are going to talk in detail about hierarchical clustering like Why we need hierarchical clustering?, How hierarchical clustering works?, Types of hierarchical clustering?, On which dataset it is applicable? . Before moving forward to hierarchal clustering, we should know why we are talking about hierarchical clustering? even when we have K Means clustering. If you have studied K Means then you know that this algorithm works on the distance to centroid method to create a cluster. Although it works well if you have well defined boundaries type dataset that has less outliers. In above picture, K Means is working well but when we move towards some complex datasets then the problem arises and K Means don't work properly. As you can see in below picture, K Means is failing in making clusters.

artificial intelligence, machine learning, matrix, (18 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Time-inhomogeneous diffusion geometry and topology

Huguet, Guillaume, Tong, Alexander, Rieck, Bastian, Huang, Jessie, Kuchroo, Manik, Hirn, Matthew, Wolf, Guy, Krishnaswamy, Smita

arXiv.org Artificial IntelligenceJan-5-2023

Diffusion condensation is a dynamic process that yields a sequence of multiscale data representations that aim to encode meaningful abstractions. It has proven effective for manifold learning, denoising, clustering, and visualization of high-dimensional data. Diffusion condensation is constructed as a time-inhomogeneous process where each step first computes and then applies a diffusion operator to the data. We theoretically analyze the convergence and evolution of this process from geometric, spectral, and topological perspectives. From a geometric perspective, we obtain convergence bounds based on the smallest transition probability and the radius of the data, whereas from a spectral perspective, our bounds are based on the eigenspectrum of the diffusion kernel. Our spectral results are of particular interest since most of the literature on data diffusion is focused on homogeneous processes. From a topological perspective, we show diffusion condensation generalizes centroid-based hierarchical clustering. We use this perspective to obtain a bound based on the number of data points, independent of their location. To understand the evolution of the data geometry beyond convergence, we use topological data analysis. We show that the condensation process itself defines an intrinsic condensation homology. We use this intrinsic topology as well as the ambient persistent homology of the condensation process to study how the data changes over diffusion time. We demonstrate both types of topological information in well-understood toy examples. Our work gives theoretical insights into the convergence of diffusion condensation, and shows that it provides a link between topological and geometric data analysis.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2203.1486

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(5 more...)

Genre:

Overview (0.67)
Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback