AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Dual Node and Edge Fairness-Aware Graph Partition

Liu, Tingwei, Li, Peizhao, Liu, Hongfu

arXiv.org Artificial IntelligenceJul-14-2023

Fair graph partition of social networks is a crucial step toward ensuring fair and non-discriminatory treatments in unsupervised user analysis. Current fair partition methods typically consider node balance, a notion pursuing a proportionally balanced number of nodes from all demographic groups, but ignore the bias induced by imbalanced edges in each cluster. To address this gap, we propose a notion edge balance to measure the proportion of edges connecting different demographic groups in clusters. We analyze the relations between node balance and edge balance, then with line graph transformations, we propose a co-embedding framework to learn dual node and edge fairness-aware representations for graph partition. We validate our framework through several social network datasets and observe balanced partition in terms of both nodes and edges along with good utility. Moreover, we demonstrate our fair partition can be used as pseudo labels to facilitate graph neural networks to behave fairly in node classification and link prediction tasks.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2306.10123

Country:

Europe > Portugal > Braga > Braga (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.70)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Dink-Net: Neural Clustering on Large Graphs

Liu, Yue, Liang, Ke, Xia, Jun, Zhou, Sihang, Yang, Xihong, Liu, Xinwang, Li, Stan Z.

arXiv.org Artificial IntelligenceJul-14-2023

Deep graph clustering, which aims to group the nodes of a graph into disjoint clusters with deep neural networks, has achieved promising progress in recent years. However, the existing methods fail to scale to the large graph with million nodes. To solve this problem, a scalable deep graph clustering method (Dink-Net) is proposed with the idea of dilation and shrink. Firstly, by discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner. Meanwhile, the cluster centres are initialized as learnable neural parameters. Subsequently, the clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss in an adversarial manner. By these settings, we unify the two-step clustering, i.e., representation learning and clustering optimization, into an end-to-end framework, guiding the network to learn clustering-friendly features. Besides, Dink-Net scales well to large graphs since the designed loss functions adopt the mini-batch data to optimize the clustering distribution even without performance drops. Both experimental results and theoretical analyses demonstrate the superiority of our method. Compared to the runner-up, Dink-Net achieves 9.62% NMI improvement on the ogbn-papers100M dataset with 111 million nodes and 1.6 billion edges. The source code is released at https://github.com/yueliu1999/Dink-Net. Besides, a collection (papers, codes, and datasets) of deep graph clustering is shared at https://github.com/yueliu1999/Awesome-Deep-Graph-Clustering.

artificial intelligence, graph, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2305.18405

Country:

Asia > China (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Global $k$-means$++$: an effective relaxation of the global $k$-means clustering algorithm

Vardakas, Georgios, Likas, Aristidis

arXiv.org Artificial IntelligenceJul-14-2023

The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global $k$-means is a deterministic algorithm proposed to tackle the random initialization problem of k-means but its well-known that requires high computational cost. It partitions the data to $K$ clusters by solving all $k$-means sub-problems incrementally for all $k=1,\ldots, K$. For each $k$ cluster problem, the method executes the $k$-means algorithm $N$ times, where $N$ is the number of datapoints. In this paper, we propose the \emph{global $k$-means\texttt{++}} clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global $k$-means with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the $k$-means\texttt{++} algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2211.12271

Country:

Europe > Greece > Epirus > Ioannina (0.05)
North America > United States > California > Alameda County > Oakland (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Breaking 3-Factor Approximation for Correlation Clustering in Polylogarithmic Rounds

Cao, Nairen, Huang, Shang-En, Su, Hsin-Hao

arXiv.org Artificial IntelligenceJul-13-2023

In this paper, we study parallel algorithms for the correlation clustering problem, where every pair of two different entities is labeled with similar or dissimilar. The goal is to partition the entities into clusters to minimize the number of disagreements with the labels. Currently, all efficient parallel algorithms have an approximation ratio of at least 3. In comparison with the $1.994+\epsilon$ ratio achieved by polynomial-time sequential algorithms [CLN22], a significant gap exists. We propose the first poly-logarithmic depth parallel algorithm that achieves a better approximation ratio than 3. Specifically, our algorithm computes a $(2.4+\epsilon)$-approximate solution and uses $\tilde{O}(m^{1.5})$ work. Additionally, it can be translated into a $\tilde{O}(m^{1.5})$-time sequential algorithm and a poly-logarithmic rounds sublinear-memory MPC algorithm with $\tilde{O}(m^{1.5})$ total memory. Our approach is inspired by Awerbuch, Khandekar, and Rao's [AKR12] length-constrained multi-commodity flow algorithm, where we develop an efficient parallel algorithm to solve a truncated correlation clustering linear program of Charikar, Guruswami, and Wirth [CGW05]. Then we show the solution of the truncated linear program can be rounded with a factor of at most 2.4 loss by using the framework of [CMSY15]. Such a rounding framework can then be implemented using parallel pivot-based approaches.

artificial intelligence, machine learning, triangle, (18 more...)

arXiv.org Artificial Intelligence

2307.06723

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.24)
North America > United States > Colorado (0.14)
Europe > Denmark (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Student Assessment in Cybersecurity Training Automated by Pattern Mining and Clustering

Švábenský, Valdemar, Vykopal, Jan, Čeleda, Pavel, Tkáčik, Kristián, Popovič, Daniel

arXiv.org Artificial IntelligenceJul-13-2023

Hands-on cybersecurity training allows students and professionals to practice various tools and improve their technical skills. The training occurs in an interactive learning environment that enables completing sophisticated tasks in full-fledged operating systems, networks, and applications. During the training, the learning environment allows collecting data about trainees' interactions with the environment, such as their usage of command-line tools. These data contain patterns indicative of trainees' learning processes, and revealing them allows to assess the trainees and provide feedback to help them learn. However, automated analysis of these data is challenging. The training tasks feature complex problem-solving, and many different solution approaches are possible. Moreover, the trainees generate vast amounts of interaction data. This paper explores a dataset from 18 cybersecurity training sessions using data mining and machine learning techniques. We employed pattern mining and clustering to analyze 8834 commands collected from 113 trainees, revealing their typical behavior, mistakes, solution strategies, and difficult training stages. Pattern mining proved suitable in capturing timing information and tool usage frequency. Clustering underlined that many trainees often face the same issues, which can be addressed by targeted scaffolding. Our results show that data mining methods are suitable for analyzing cybersecurity training data. Educational researchers and practitioners can apply these methods in their contexts to assess trainees, support them, and improve the training design. Artifacts associated with this research are publicly available.

data mining, machine learning, pattern recognition, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s10639-022-10954-4

2307.1026

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Austria > Vienna (0.14)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Instructional Material (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (1.00)
Education > Educational Technology > Educational Software (0.87)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)

Add feedback

A Comprehensive Review of Automated Data Annotation Techniques in Human Activity Recognition

Demrozi, Florenc, Turetta, Cristian, Machot, Fadi Al, Pravadelli, Graziano, Kindt, Philipp H.

arXiv.org Artificial IntelligenceJul-12-2023

Human Activity Recognition (HAR) has become one of the leading research topics of the last decade. As sensing technologies have matured and their economic costs have declined, a host of novel applications, e.g., in healthcare, industry, sports, and daily life activities have become popular. The design of HAR systems requires different time-consuming processing steps, such as data collection, annotation, and model training and optimization. In particular, data annotation represents the most labor-intensive and cumbersome step in HAR, since it requires extensive and detailed manual work from human annotators. Therefore, different methodologies concerning the automation of the annotation procedure in HAR have been proposed. The annotation problem occurs in different notions and scenarios, which all require individual solutions. In this paper, we provide the first systematic review on data annotation techniques for HAR. By grouping existing approaches into classes and providing a taxonomy, our goal is to support the decision on which techniques can be beneficially used in a given scenario.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.05988

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Italy (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment (1.00)
Information Technology (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Consumer Health (0.92)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Internet of Things (1.00)
Information Technology > Information Management (1.00)
(10 more...)

Add feedback

FIS-ONE: Floor Identification System with One Label for Crowdsourced RF Signals

Zhuo, Weipeng, Chiu, Ka Ho, Chen, Jierun, Zhao, Ziqi, Chan, S. -H. Gary, Ha, Sangtae, Lee, Chul-Ho

arXiv.org Artificial IntelligenceJul-12-2023

Floor labels of crowdsourced RF signals are crucial for many smart-city applications, such as multi-floor indoor localization, geofencing, and robot surveillance. To build a prediction model to identify the floor number of a new RF signal upon its measurement, conventional approaches using the crowdsourced RF signals assume that at least few labeled signal samples are available on each floor. In this work, we push the envelope further and demonstrate that it is technically feasible to enable such floor identification with only one floor-labeled signal sample on the bottom floor while having the rest of signal samples unlabeled. We propose FIS-ONE, a novel floor identification system with only one labeled sample. FIS-ONE consists of two steps, namely signal clustering and cluster indexing. We first build a bipartite graph to model the RF signal samples and obtain a latent representation of each node (each signal sample) using our attention-based graph neural network model so that the RF signal samples can be clustered more accurately. Then, we tackle the problem of indexing the clusters with proper floor labels, by leveraging the observation that signals from an access point can be detected on different floors, i.e., signal spillover. Specifically, we formulate a cluster indexing problem as a combinatorial optimization problem and show that it is equivalent to solving a traveling salesman problem, whose (near-)optimal solution can be found efficiently. We have implemented FIS-ONE and validated its effectiveness on the Microsoft dataset and in three large shopping malls. Our results show that FIS-ONE outperforms other baseline algorithms significantly, with up to 23% improvement in adjusted rand index and 25% improvement in normalized mutual information using only one floor-labeled signal sample.

artificial intelligence, machine learning, social media, (19 more...)

arXiv.org Artificial Intelligence

2307.05914

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Texas (0.04)
North America > United States > Florida > Hillsborough County > Tampa (0.04)
(2 more...)

Genre: Research Report > New Finding (0.54)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Speech Diarization and ASR with GMM

Sharma, Aayush Kumar, Bhavikatti, Vineet, Nidawani, Amogh, Siddappaji, Dr., P, Sanath, Mishra, Dr Geetishree

arXiv.org Artificial IntelligenceJul-11-2023

In this research paper, we delve into the topics of Speech Diarization and Automatic Speech Recognition (ASR). Speech diarization involves the separation of individual speakers within an audio stream. By employing the ASR transcript, the diarization process aims to segregate each speaker's utterances, grouping them based on their unique audio characteristics. On the other hand, Automatic Speech Recognition refers to the capability of a machine or program to identify and convert spoken words and phrases into a machine-readable format. In our speech diarization approach, we utilize the Gaussian Mixer Model (GMM) to represent speech segments. The inter-cluster distance is computed based on the GMM parameters, and the distance threshold serves as the stopping criterion. ASR entails the conversion of an unknown speech waveform into a corresponding written transcription. The speech signal is analyzed using synchronized algorithms, taking into account the pitch frequency. Our primary objective typically revolves around developing a model that minimizes the Word Error Rate (WER) metric during speech transcription.

artificial intelligence, machine learning, speech diarization, (14 more...)

arXiv.org Artificial Intelligence

2307.05637

Country:

Asia > India > Karnataka > Bengaluru (0.06)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)

Genre: Research Report (0.40)

Industry:

Media (0.35)
Leisure & Entertainment (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

Contrastive Representation Disentanglement for Clustering

Ding, Fei, Zhang, Dan, Yang, Yin, Krovi, Venkat, Luo, Feng

arXiv.org Artificial IntelligenceJul-11-2023

Clustering continues to be a significant and challenging task. Recent studies have demonstrated impressive results by applying clustering to feature representations acquired through self-supervised learning, particularly on small datasets. However, when dealing with datasets containing a large number of clusters, such as ImageNet, current methods struggle to achieve satisfactory clustering performance. In this paper, we introduce a novel method called Contrastive representation Disentanglement for Clustering (CDC) that leverages contrastive learning to directly disentangle the feature representation for clustering. In CDC, we decompose the representation into two distinct components: one component encodes categorical information under an equipartition constraint, and the other component captures instance-specific factors. To train our model, we propose a contrastive loss that effectively utilizes both components of the representation. We conduct a theoretical analysis of the proposed loss and highlight how it assigns different weights to negative samples during the process of disentangling the feature representation. Further analysis of the gradients reveals that larger weights emphasize a stronger focus on hard negative samples. As a result, the proposed loss exhibits strong expressiveness, enabling efficient disentanglement of categorical information. Through experimental evaluation on various benchmark datasets, our method demonstrates either state-of-the-art or highly competitive clustering performance. Notably, on the complete ImageNet dataset, we achieve an accuracy of 53.4%, surpassing existing methods by a substantial margin of +10.2%.

artificial intelligence, machine learning, representation, (18 more...)

arXiv.org Artificial Intelligence

2306.05439

Country:

North America > United States > Texas (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > New York (0.04)
Asia > China (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.86)

Industry: Health & Medicine (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Identifying Subgroups of ICU Patients Using End-to-End Multivariate Time-Series Clustering Algorithm Based on Real-World Vital Signs Data

Shi, Tongyue, Zhang, Zhilong, Liu, Wentie, Fang, Junhua, Hao, Jianguo, Jin, Shuai, Zhao, Huiying, Kong, Guilan

arXiv.org Artificial IntelligenceJul-11-2023

This study employed the MIMIC-IV database as data source to investigate the use of dynamic, high-frequency, multivariate time-series vital signs data, including temperature, heart rate, mean blood pressure, respiratory rate, and SpO2, monitored first 8 hours data in the ICU stay. Various clustering algorithms were compared, and an end-to-end multivariate time series clustering system called Time2Feat, combined with K-Means, was chosen as the most effective method to cluster patients in the ICU. In clustering analysis, data of 8,080 patients admitted between 2008 and 2016 was used for model development and 2,038 patients admitted between 2017 and 2019 for model validation. By analyzing the differences in clinical mortality prognosis among different categories, varying risks of ICU mortality and hospital mortality were found between different subgroups. Furthermore, the study visualized the trajectory of vital signs changes. The findings of this study provide valuable insights into the potential use of multivariate time-series clustering systems in patient management and monitoring in the ICU setting.

artificial intelligence, machine learning, subgroup, (10 more...)

arXiv.org Artificial Intelligence

2306.02121

Country:

North America > United States (0.15)
Asia > China > Beijing > Beijing (0.06)
Asia > China > Zhejiang Province > Hangzhou (0.05)

Genre:

Research Report > New Finding (0.57)
Research Report > Experimental Study (0.35)

Industry:

Health & Medicine > Diagnostic Medicine > Vital Signs (1.00)
Health & Medicine > Health Care Providers & Services (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.91)

Add feedback