Clustering
Clustering and analysis of user behaviour in blockchain: A case study of Planet IX
Zelenyanszki, Dorottya, Hou, Zhe, Biswas, Kamanashis, Muthukkumarasamy, Vallipuram
Decentralised applications (dApps) that run on public blockchains have the benefit of trustworthiness and transparency as every activity that happens on the blockchain can be publicly traced through the transaction data. However, this introduces a potential privacy problem as this data can be tracked and analysed, which can reveal user-behaviour information. A user behaviour analysis pipeline was proposed to present how this type of information can be extracted and analysed to identify separate behavioural clusters that can describe how users behave in the game. The pipeline starts with the collection of transaction data, involving smart contracts, that is collected from a blockchain-based game called Planet IX. Both the raw transaction information and the transaction events are considered in the data collection. From this data, separate game actions can be formed and those are leveraged to present how and when the users conducted their in-game activities in the form of user flows. An extended version of these user flows also presents how the Non-Fungible Tokens (NFTs) are being leveraged in the user actions. The latter is given as input for a Graph Neural Network (GNN) model to provide graph embeddings for these flows which then can be leveraged by clustering algorithms to cluster user behaviours into separate behavioural clusters. We benchmark and compare well-known clustering algorithms as a part of the proposed method. The user behaviour clusters were analysed and visualised in a graph format. It was found that behavioural information can be extracted regarding the users that belong to these clusters. Such information can be exploited by malicious users to their advantage. To demonstrate this, a privacy threat model was also presented based on the results that correspond to multiple potentially affected areas.
Robust Markov stability for community detection at a scale learned based on the structure
Aref, Samin, Mathiyarasan, Sanchaai
Community detection, the unsupervised task of clustering nodes of a graph, finds applications across various fields. The common approaches for community detection involve optimizing an objective function to partition the nodes into communities at a single scale of granularity. However, the single-scale approaches often fall short of producing partitions that are robust and at a suitable scale. The existing algorithm, PyGenStability, returns multiple robust partitions for a network by optimizing the multi-scale Markov stability function. However, in cases where the suitable scale is not known or assumed by the user, there is no principled method to select a single robust partition at a suitable scale from the multiple partitions that PyGenStability produces. Our proposed method combines the Markov stability framework with a pre-trained machine learning model for scale selection to obtain one robust partition at a scale that is learned based on the graph structure. This automatic scale selection involves using a gradient boosting model pre-trained on hand-crafted and embedding-based network features from a labeled dataset of 10k benchmark networks. This model was trained to predicts the scale value that maximizes the similarity of the output partition to the planted partition of the benchmark network. Combining our scale selection algorithm with the PyGenStability algorithm results in PyGenStabilityOne (PO): a hyperparameter-free multi-scale community detection algorithm that returns one robust partition at a suitable scale without the need for any assumptions, input, or tweaking from the user. We compare the performance of PO against 29 algorithms and show that it outperforms 25 other algorithms by statistically meaningful margins. Our results facilitate choosing between community detection algorithms, among which PO stands out as the accurate, robust, and hyperparameter-free method.
InfoClus: Informative Clustering of High-dimensional Data Embeddings
Lai, Fuyin, Heiter, Edith, Bied, Guillaume, Lijffijt, Jefrey
Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.
Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations
Yang, Tianjian, Li, Wei Vivian
Generalized Probabilistic Canonical Correlation Analysis for Multi-modal Data Integration with Full or Partial Observations Tianjian Y ang 1 and Wei Vivian Li 1,* 1 Department of Statistics, University of California, Riverside * T o whom correspondence should be addressed: weil@ucr.edu Abstract Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. T o make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/ GPCCA . 1 Introduction Many real-world datasets can be described from multiple perspectives, where each perspective, typically represented as a matrix, corresponds to a data modality . A dataset that is consisted of multiple modalities collected from the same set of individuals is termed a multi-modal dataset [1]. Examples include medical imaging data combining computed tomography (CT) and magnetic 1 arXiv:2504.11610v1 T echnological advances have made the collection of multi-modal data increasingly prevalent, enabling integrative analyses that leverage information across modalities.
Lightweight Trustworthy Distributed Clustering
Li, Hongyang, Wu, Caesar, Chadli, Mohammed, Mammar, Said, Bouvry, Pascal
Ensuring data trustworthiness within individual edge node s while facilitating collaborative data processing poses a critical challenge in edge computing system s (ECS), particularly in resource-constrained scenarios such as autonomous systems sensor networks, indu strial IoT, and smart cities. This paper presents a lightweight, fully distributed k -means clustering algorithm specifically adapted for edge e nvi-ronments, leveraging a distributed averaging approach wit h additive secret sharing, a secure multiparty computation technique, during the cluster center update ph ase to ensure the accuracy and trustworthiness of data across nodes. Edge computing, a paradigm emerging from distributed compu ting, emphasizes processing data at or near its source to minimize latency and reduce band width consumption [1]-[3]. The rapid advancements in edge computing technologies, includ ing algorithms for decentralized and efficient data processing, have significantly accelerated t he deployment of distributed sensor networks. Two key properties of ECS are crucial in large-scale deploym ents.
IsoSEL: Isometric Structural Entropy Learning for Deep Graph Clustering in Hyperbolic Space
Sun, Li, Huang, Zhenhao, Wang, Yujie, Lv, Hongbo, Liu, Chunyang, Peng, Hao, Yu, Philip S.
Graph clustering is a longstanding topic in machine learning. In recent years, deep learning methods have achieved encouraging results, but they still require predefined cluster numbers K, and typically struggle with imbalanced graphs, especially in identifying minority clusters. The limitations motivate us to study a challenging yet practical problem: deep graph clustering without K considering the imbalance in reality. We approach this problem from a fresh perspective of information theory (i.e., structural information). In the literature, structural information has rarely been touched in deep clustering, and the classic definition falls short in its discrete formulation, neglecting node attributes and exhibiting prohibitive complexity. In this paper, we first establish a new Differentiable Structural Information, generalizing the discrete formalism to continuous realm, so that the optimal partitioning tree, revealing the cluster structure, can be created by the gradient backpropagation. Theoretically, we demonstrate its capability in clustering without requiring K and identifying the minority clusters in imbalanced graphs, while reducing the time complexity to O(N) w.r.t. the number of nodes. Subsequently, we present a novel IsoSEL framework for deep graph clustering, where we design a hyperbolic neural network to learn the partitioning tree in the Lorentz model of hyperbolic space, and further conduct Lorentz Tree Contrastive Learning with isometric augmentation. As a result, the partitioning tree incorporates node attributes via mutual information maximization, while the cluster assignment is refined by the proposed tree contrastive learning. Extensive experiments on five benchmark datasets show the IsoSEL outperforms 14 recent baselines by an average of +1.3% in NMI.
SPOT: Spatio-Temporal Pattern Mining and Optimization for Load Consolidation in Freight Transportation Networks
Cheng, Sikai, Hijazi, Amira, Konak, Jeren, Erera, Alan, Van Hentenryck, Pascal
Freight consolidation has significant potential to reduce transportation costs and mitigate congestion and pollution. An effective load consolidation plan relies on carefully chosen consolidation points to ensure alignment with existing transportation management processes, such as driver scheduling, personnel planning, and terminal operations. This complexity represents a significant challenge when searching for optimal consolidation strategies. Traditional optimization-based methods provide exact solutions, but their computational complexity makes them impractical for large-scale instances and they fail to leverage historical data. Machine learning-based approaches address these issues but often ignore operational constraints, leading to infeasible consolidation plans. This work proposes SPOT, an end-to-end approach that integrates the benefits of machine learning (ML) and optimization for load consolidation. The ML component plays a key role in the planning phase by identifying the consolidation points through spatio-temporal clustering and constrained frequent itemset mining, while the optimization selects the most cost-effective feasible consolidation routes for a given operational day. Extensive experiments conducted on industrial load data demonstrate that SPOT significantly reduces travel distance and transportation costs (by about 50% on large terminals) compared to the existing industry-standard load planning strategy and a neighborhood-based heuristic. Moreover, the ML component provides valuable tactical-level insights by identifying frequently recurring consolidation opportunities that guide proactive planning. In addition, SPOT is computationally efficient and can be easily scaled to accommodate large transportation networks.
Adaptive Cluster-Based Synthetic Minority Oversampling Technique for Traffic Mode Choice Prediction with Imbalanced Dataset
Urban datasets such as citizen transportation modes often contain disproportionately distributed classes, posing significant challenges to the classification of under-represented samples using data-driven models. In the literature, various resampling methods have been developed to create synthetic data for minority classes (oversampling) or remove samples from majority classes (undersampling) to alleviate class imbalance. However, oversampling approaches tend to overgeneralize minor classes that are closely clustered and neglect sparse regions which may contain crucial information. Conversely, undersampling methods potentially remove useful information on certain subgroups. Hence, a resampling approach that takes the inherent distribution of data into consideration is required to ensure appropriate synthetic data creation. This study proposes an adaptive cluster-based synthetic minority oversampling technique. Density-based spatial clustering is applied on minority classes to identify subgroups based on their input features. The classes in each of these subgroups are then oversampled according to the ratio of data points of their local cluster to the largest majority class. When used in conjunction with machine learning models such as random forest and extreme gradient boosting, this oversampling method results in significantly higher F1 scores for the minority classes compared to other resampling techniques. These improved models provide accurate classification of transportation modes.
FedSAUC: A Similarity-Aware Update Control for Communication-Efficient Federated Learning in Edge Computing
Lee, Ming-Lun, Chou, Han-Chang, Chen, Yan-Ann
Federated learning is a distributed machine learning framework to collaboratively train a global model without uploading privacy-sensitive data onto a centralized server. Usually, this framework is applied to edge devices such as smartphones, wearable devices, and Internet of Things (IoT) devices which closely collect information from users. However, these devices are mostly battery-powered. The update procedure of federated learning will constantly consume the battery power and the transmission bandwidth. In this work, we propose an update control for federated learning, FedSAUC, by considering the similarity of users' behaviors (models). At the server side, we exploit clustering algorithms to group devices with similar models. Then we select some representatives for each cluster to update information to train the model. We also implemented a testbed prototyping on edge devices for validating the performance. The experimental results show that this update control will not affect the training accuracy in the long run.
Application of machine learning models to predict the relationship between air pollution, ecosystem degradation, and health disparities and lung cancer in Vietnam
Tran, Ngoc Hong, Vien, Lan Kim, Le, Ngoc-Thao Thi
Lung cancer is one of the major causes of death worldwide, and Vietnam is not an exception. This disease is the second most common type of cancer globally and the second most common cause of death in Vietnam, just after liver cancer, with 23,797 fatal cases and 26,262 new cases, or 14.4% of the disease in 2020. Recently, with rising disease rates in Vietnam causing a huge public health burden, lung cancer continues to hold the top position in attention and care. Especially together with climate change, under a variety of types of pollution, deforestation, and modern lifestyles, lung cancer risks are on red alert, particularly in Vietnam. To understand more about the severe disease sources in Vietnam from a diversity of key factors, including environmental features and the current health state, with a particular emphasis on Vietnam's distinct socioeconomic and ecological context, we utilize large datasets such as patient health records and environmental indicators containing necessary information, such as deforestation rate, green cover rate, air pollution, and lung cancer risks, that is collected from well-known governmental sharing websites. Then, we process and connect them and apply analytical methods (heatmap, information gain, p-value, spearman correlation) to determine causal correlations influencing lung cancer risks. Moreover, we deploy machine learning (ML) models (Decision Tree, Random Forest, Support Vector Machine, K-mean clustering) to discover cancer risk patterns. Our experimental results, leveraged by the aforementioned ML models to identify the disease patterns, are promising, particularly, the models as Random Forest, SVM, and PCA are working well on the datasets and give high accuracy (99%), however, the K means clustering has very low accuracy (10%) and does not fit the datasets.