AITopics

2504.11702

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)
Banking & Finance > Trading (0.93)
Information Technology > Services > e-Commerce Services (0.66)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
(2 more...)

Aref, Samin, Mathiyarasan, Sanchaai

Robust Markov stability for community detection at a scale learned based on the structure

arXiv.org Artificial IntelligenceApr-17-2025

Community detection, the unsupervised task of clustering nodes of a graph, finds applications across various fields. The common approaches for community detection involve optimizing an objective function to partition the nodes into communities at a single scale of granularity. However, the single-scale approaches often fall short of producing partitions that are robust and at a suitable scale. The existing algorithm, PyGenStability, returns multiple robust partitions for a network by optimizing the multi-scale Markov stability function. However, in cases where the suitable scale is not known or assumed by the user, there is no principled method to select a single robust partition at a suitable scale from the multiple partitions that PyGenStability produces. Our proposed method combines the Markov stability framework with a pre-trained machine learning model for scale selection to obtain one robust partition at a scale that is learned based on the graph structure. This automatic scale selection involves using a gradient boosting model pre-trained on hand-crafted and embedding-based network features from a labeled dataset of 10k benchmark networks. This model was trained to predicts the scale value that maximizes the similarity of the output partition to the planted partition of the benchmark network. Combining our scale selection algorithm with the PyGenStability algorithm results in PyGenStabilityOne (PO): a hyperparameter-free multi-scale community detection algorithm that returns one robust partition at a suitable scale without the need for any assumptions, input, or tweaking from the user. We compare the performance of PO against 29 algorithms and show that it outperforms 25 other algorithms by statistically meaningful margins. Our results facilitate choosing between community detection algorithms, among which PO stands out as the accurate, robust, and hyperparameter-free method.

artificial intelligence, data mining, machine learning, (20 more...)

2504.11621

Country:

North America > United States (1.00)
Europe (0.94)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment (0.93)
Telecommunications > Networks (0.34)
Information Technology > Networks (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Lai, Fuyin, Heiter, Edith, Bied, Guillaume, Lijffijt, Jefrey

InfoClus: Informative Clustering of High-dimensional Data Embeddings

arXiv.org Artificial IntelligenceApr-16-2025

Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.

artificial intelligence, infoclus, machine learning, (16 more...)

2504.11089

Country: Europe (0.28)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Yang, Tianjian, Li, Wei Vivian

Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations

arXiv.org Machine LearningApr-15-2025

Generalized Probabilistic Canonical Correlation Analysis for Multi-modal Data Integration with Full or Partial Observations Tianjian Y ang 1 and Wei Vivian Li 1,* 1 Department of Statistics, University of California, Riverside * T o whom correspondence should be addressed: weil@ucr.edu Abstract Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. T o make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/ GPCCA . 1 Introduction Many real-world datasets can be described from multiple perspectives, where each perspective, typically represented as a matrix, corresponds to a data modality . A dataset that is consisted of multiple modalities collected from the same set of individuals is termed a multi-modal dataset [1]. Examples include medical imaging data combining computed tomography (CT) and magnetic 1 arXiv:2504.11610v1 T echnological advances have made the collection of multi-modal data increasingly prevalent, enabling integrative analyses that leverage information across modalities.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

2504.1161

Country:

North America > United States > California > Riverside County > Riverside (0.24)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.89)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Lightweight Trustworthy Distributed Clustering

Li, Hongyang, Wu, Caesar, Chadli, Mohammed, Mammar, Said, Bouvry, Pascal

Ensuring data trustworthiness within individual edge node s while facilitating collaborative data processing poses a critical challenge in edge computing system s (ECS), particularly in resource-constrained scenarios such as autonomous systems sensor networks, indu strial IoT, and smart cities. This paper presents a lightweight, fully distributed k -means clustering algorithm specifically adapted for edge e nvi-ronments, leveraging a distributed averaging approach wit h additive secret sharing, a secure multiparty computation technique, during the cluster center update ph ase to ensure the accuracy and trustworthiness of data across nodes. Edge computing, a paradigm emerging from distributed compu ting, emphasizes processing data at or near its source to minimize latency and reduce band width consumption [1]-[3]. The rapid advancements in edge computing technologies, includ ing algorithms for decentralized and efficient data processing, have significantly accelerated t he deployment of distributed sensor networks. Two key properties of ECS are crucial in large-scale deploym ents.

artificial intelligence, machine learning, node, (18 more...)

2504.10109

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Networks > Sensor Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.57)

IsoSEL: Isometric Structural Entropy Learning for Deep Graph Clustering in Hyperbolic Space

Sun, Li, Huang, Zhenhao, Wang, Yujie, Lv, Hongbo, Liu, Chunyang, Peng, Hao, Yu, Philip S.

Graph clustering is a longstanding topic in machine learning. In recent years, deep learning methods have achieved encouraging results, but they still require predefined cluster numbers K, and typically struggle with imbalanced graphs, especially in identifying minority clusters. The limitations motivate us to study a challenging yet practical problem: deep graph clustering without K considering the imbalance in reality. We approach this problem from a fresh perspective of information theory (i.e., structural information). In the literature, structural information has rarely been touched in deep clustering, and the classic definition falls short in its discrete formulation, neglecting node attributes and exhibiting prohibitive complexity. In this paper, we first establish a new Differentiable Structural Information, generalizing the discrete formalism to continuous realm, so that the optimal partitioning tree, revealing the cluster structure, can be created by the gradient backpropagation. Theoretically, we demonstrate its capability in clustering without requiring K and identifying the minority clusters in imbalanced graphs, while reducing the time complexity to O(N) w.r.t. the number of nodes. Subsequently, we present a novel IsoSEL framework for deep graph clustering, where we design a hyperbolic neural network to learn the partitioning tree in the Lorentz model of hyperbolic space, and further conduct Lorentz Tree Contrastive Learning with isometric augmentation. As a result, the partitioning tree incorporates node attributes via mutual information maximization, while the cluster assignment is refined by the proposed tree contrastive learning. Extensive experiments on five benchmark datasets show the IsoSEL outperforms 14 recent baselines by an average of +1.3% in NMI.

artificial intelligence, isometric structural entropy learning, machine learning, (3 more...)

2504.0997

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

SPOT: Spatio-Temporal Pattern Mining and Optimization for Load Consolidation in Freight Transportation Networks

Cheng, Sikai, Hijazi, Amira, Konak, Jeren, Erera, Alan, Van Hentenryck, Pascal

Freight consolidation has significant potential to reduce transportation costs and mitigate congestion and pollution. An effective load consolidation plan relies on carefully chosen consolidation points to ensure alignment with existing transportation management processes, such as driver scheduling, personnel planning, and terminal operations. This complexity represents a significant challenge when searching for optimal consolidation strategies. Traditional optimization-based methods provide exact solutions, but their computational complexity makes them impractical for large-scale instances and they fail to leverage historical data. Machine learning-based approaches address these issues but often ignore operational constraints, leading to infeasible consolidation plans. This work proposes SPOT, an end-to-end approach that integrates the benefits of machine learning (ML) and optimization for load consolidation. The ML component plays a key role in the planning phase by identifying the consolidation points through spatio-temporal clustering and constrained frequent itemset mining, while the optimization selects the most cost-effective feasible consolidation routes for a given operational day. Extensive experiments conducted on industrial load data demonstrate that SPOT significantly reduces travel distance and transportation costs (by about 50% on large terminals) compared to the existing industry-standard load planning strategy and a neighborhood-based heuristic. Moreover, the ML component provides valuable tactical-level insights by identifying frequently recurring consolidation opportunities that guide proactive planning. In addition, SPOT is computationally efficient and can be easily scaled to accommodate large transportation networks.

consolidation, machine learning, pattern recognition, (21 more...)

2504.0968

Country:

Asia (0.92)
North America > United States > California (0.46)

Genre: Research Report (0.82)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Freight & Logistics Services (1.00)
Consumer Products & Services > Travel (1.00)
Transportation > Ground > Road (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.82)
(2 more...)

Ooi, Guang An, Ahmed, Shehab

Adaptive Cluster-Based Synthetic Minority Oversampling Technique for Traffic Mode Choice Prediction with Imbalanced Dataset

Urban datasets such as citizen transportation modes often contain disproportionately distributed classes, posing significant challenges to the classification of under-represented samples using data-driven models. In the literature, various resampling methods have been developed to create synthetic data for minority classes (oversampling) or remove samples from majority classes (undersampling) to alleviate class imbalance. However, oversampling approaches tend to overgeneralize minor classes that are closely clustered and neglect sparse regions which may contain crucial information. Conversely, undersampling methods potentially remove useful information on certain subgroups. Hence, a resampling approach that takes the inherent distribution of data into consideration is required to ensure appropriate synthetic data creation. This study proposes an adaptive cluster-based synthetic minority oversampling technique. Density-based spatial clustering is applied on minority classes to identify subgroups based on their input features. The classes in each of these subgroups are then oversampled according to the ratio of data points of their local cluster to the largest majority class. When used in conjunction with machine learning models such as random forest and extreme gradient boosting, this oversampling method results in significantly higher F1 scores for the minority classes compared to other resampling techniques. These improved models provide accurate classification of transportation modes.

artificial intelligence, dataset, machine learning, (16 more...)

2504.09486

Country: Asia > Middle East > Saudi Arabia (0.14)

Genre: Research Report (1.00)

Industry: Transportation (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Lee, Ming-Lun, Chou, Han-Chang, Chen, Yan-Ann

FedSAUC: A Similarity-Aware Update Control for Communication-Efficient Federated Learning in Edge Computing

Federated learning is a distributed machine learning framework to collaboratively train a global model without uploading privacy-sensitive data onto a centralized server. Usually, this framework is applied to edge devices such as smartphones, wearable devices, and Internet of Things (IoT) devices which closely collect information from users. However, these devices are mostly battery-powered. The update procedure of federated learning will constantly consume the battery power and the transmission bandwidth. In this work, we propose an update control for federated learning, FedSAUC, by considering the similarity of users' behaviors (models). At the server side, we exploit clustering algorithms to group devices with similar models. Then we select some representatives for each cluster to update information to train the model. We also implemented a testbed prototyping on edge devices for validating the performance. The experimental results show that this update control will not affect the training accuracy in the long run.

artificial intelligence, federated learning, machine learning, (15 more...)

doi: 10.23919/ICMU50196.2021.9638814

2504.04867

Genre: Research Report > New Finding (0.34)

Industry:

Energy > Energy Storage (0.54)
Information Technology > Hardware (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Tran, Ngoc Hong, Vien, Lan Kim, Le, Ngoc-Thao Thi

Application of machine learning models to predict the relationship between air pollution, ecosystem degradation, and health disparities and lung cancer in Vietnam

arXiv.org Artificial IntelligenceApr-14-2025

Lung cancer is one of the major causes of death worldwide, and Vietnam is not an exception. This disease is the second most common type of cancer globally and the second most common cause of death in Vietnam, just after liver cancer, with 23,797 fatal cases and 26,262 new cases, or 14.4% of the disease in 2020. Recently, with rising disease rates in Vietnam causing a huge public health burden, lung cancer continues to hold the top position in attention and care. Especially together with climate change, under a variety of types of pollution, deforestation, and modern lifestyles, lung cancer risks are on red alert, particularly in Vietnam. To understand more about the severe disease sources in Vietnam from a diversity of key factors, including environmental features and the current health state, with a particular emphasis on Vietnam's distinct socioeconomic and ecological context, we utilize large datasets such as patient health records and environmental indicators containing necessary information, such as deforestation rate, green cover rate, air pollution, and lung cancer risks, that is collected from well-known governmental sharing websites. Then, we process and connect them and apply analytical methods (heatmap, information gain, p-value, spearman correlation) to determine causal correlations influencing lung cancer risks. Moreover, we deploy machine learning (ML) models (Decision Tree, Random Forest, Support Vector Machine, K-mean clustering) to discover cancer risk patterns. Our experimental results, leveraged by the aforementioned ML models to identify the disease patterns, are promising, particularly, the models as Random Forest, SVM, and PCA are working well on the datasets and give high accuracy (99%), however, the K means clustering has very low accuracy (10%) and does not fit the datasets.

artificial intelligence, lung cancer, machine learning, (15 more...)

2504.08651

Country: Asia > Vietnam (1.00)

Genre: Research Report > Experimental Study (0.91)

Industry: Health & Medicine > Therapeutic Area > Oncology > Lung Cancer (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)