AITopics

2506.15879

Country: Asia > Middle East > Jordan (0.29)

Genre: Research Report > New Finding (1.00)

Industry: Banking & Finance > Trading (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

arXiv.org Artificial IntelligenceJun-23-2025

AI-based modular warning machine for risk identification in proximity healthcare

Razzetta, Chiara, Noei, Shahryar, Barbarossa, Federico, Spairani, Edoardo, Roascio, Monica, Barbi, Elisa, Ciacci, Giulia, Sommariva, Sara, Guastavino, Sabrina, Piana, Michele, Lenge, Matteo, Arnulfo, Gabriele, Magenes, Giovanni, Maranesi, Elvira, Amabili, Giulio, Massone, Anna Maria, Benvenuto, Federico, Jurman, Giuseppe, Sona, Diego, Campi, Cristina

"DHEAL-COM - Digital Health Solutions in Community Medicine" is a research and technology project funded by the Italian Department of Health for the development of digital solutions of interest in proximity healthcare. The activity within the DHEAL-COM framework allows scientists to gather a notable amount of multi-modal data whose interpretation can be performed by means of machine learning algorithms. The present study illustrates a general automated pipeline made of numerous unsupervised and supervised methods that can ingest such data, provide predictive results, and facilitate model interpretations via feature identification.

artificial intelligence, data mining, machine learning, (13 more...)

2506.15823

Country: Europe > Italy (0.90)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.69)
Health & Medicine > Diagnostic Medicine (0.68)
Health & Medicine > Health Care Technology (0.67)
Government > Regional Government > Europe Government > Italy Government (0.35)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Silva-Sánchez, David, Lederman, Roy R.

An Observation on Lloyd's k-Means Algorithm in High Dimensions

arXiv.org Machine LearningJun-19-2025

Clustering and estimating cluster means are core problems in statistics and machine learning, with k-means and Expectation Maximization (EM) being two widely used algorithms. In this work, we provide a theoretical explanation for the failure of k-means in high-dimensional settings with high noise and limited sample sizes, using a simple Gaussian Mixture Model (GMM). We identify regimes where, with high probability, almost every partition of the data becomes a fixed point of the k-means algorithm. This study is motivated by challenges in the analysis of more complex cases, such as masked GMMs, and those arising from applications in Cryo-Electron Microscopy.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Machine Learning

2506.14952

Country: North America > United States > Connecticut > New Haven County > New Haven (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Lee, Junghyun, Jedra, Yassir, Proutière, Alexandre, Yun, Se-Young

Near-Optimal Clustering in Mixture of Markov Chains

arXiv.org Machine LearningJun-19-2025

We study the problem of clustering $T$ trajectories of length $H$, each generated by one of $K$ unknown ergodic Markov chains over a finite state space of size $S$. The goal is to accurately group trajectories according to their underlying generative model. We begin by deriving an instance-dependent, high-probability lower bound on the clustering error rate, governed by the weighted KL divergence between the transition kernels of the chains. We then present a novel two-stage clustering algorithm. In Stage~I, we apply spectral clustering using a new injective Euclidean embedding for ergodic Markov chains -- a contribution of independent interest that enables sharp concentration results. Stage~II refines the initial clusters via a single step of likelihood-based reassignment. Our method achieves a near-optimal clustering error with high probability, under the conditions $H = \tildeΩ(γ_{\mathrm{ps}}^{-1} (S^2 \vee π_{\min}^{-1}))$ and $TH = \tildeΩ(γ_{\mathrm{ps}}^{-1} S^2 )$, where $π_{\min}$ is the minimum stationary probability of a state across the $K$ chains and $γ_{\mathrm{ps}}$ is the minimum pseudo-spectral gap. These requirements provide significant improvements, if not at least comparable, to the state-of-the-art guarantee (Kausik et al., 2023), and moreover, our algorithm offers a key practical advantage: unlike existing approach, it requires no prior knowledge of model-specific quantities (e.g., separation between kernels or visitation probabilities). We conclude by discussing the inherent gap between our upper and lower bounds, providing insights into the unique structure of this clustering problem.

artificial intelligence, machine learning, markov chain, (14 more...)

arXiv.org Machine Learning

2506.01324

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
(6 more...)

Genre: Research Report (0.64)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Contributions to Representation Learning with Graph Autoencoders and Applications to Music Recommendation

Salha-Galvan, Guillaume

Graph autoencoders (GAE) and variational graph autoencoders (VGAE) emerged as two powerful groups of unsupervised node embedding methods, with various applications to graph-based machine learning problems such as link prediction and community detection. Nonetheless, at the beginning of this Ph.D. project, GAE and VGAE models were also suffering from key limitations, preventing them from being adopted in the industry. In this thesis, we present several contributions to improve these models, with the general aim of facilitating their use to address industrial-level problems involving graph representations. Firstly, we propose two strategies to overcome the scalability issues of previous GAE and VGAE models, permitting to effectively train these models on large graphs with millions of nodes and edges. These strategies leverage graph degeneracy and stochastic subgraph decoding techniques, respectively. Besides, we introduce Gravity-Inspired GAE and VGAE, providing the first extensions of these models for directed graphs, that are ubiquitous in industrial applications. We also consider extensions of GAE and VGAE models for dynamic graphs. Furthermore, we argue that GAE and VGAE models are often unnecessarily complex, and we propose to simplify them by leveraging linear encoders. Lastly, we introduce Modularity-Aware GAE and VGAE to improve community detection on graphs, while jointly preserving good performances on link prediction. In the last part of this thesis, we evaluate our methods on several graphs extracted from the music streaming service Deezer. We put the emphasis on graph-based music recommendation problems. In particular, we show that our methods can improve the detection of communities of similar musical items to recommend to users, that they can effectively rank similar artists in a cold start setting, and that they permit modeling the music genre perception across cultures.

artificial intelligence, data mining, machine learning, (20 more...)

2205.14651

Country:

North America > United States (1.00)
Asia (0.67)
Europe > United Kingdom > England (0.27)

Genre:

Summary/Review (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Kühn, Damin, Schaub, Michael T.

Global Ground Metric Learning with Applications to scRNA data

Optimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying importance of different features in the data, and existing supervised approaches to ground metric learning often do not generalize across multiple classes or are restricted to distributions with shared supports. To address these limitations, we propose a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points like a global metric, but requires only class labels on a distribution-level for training. The learned global ground metric enables more accurate optimal transport distances, leading to improved performance in embedding, clustering and classification tasks. We demonstrate the effectiveness and interpretability of our approach using patient-level scRNA-seq data spanning multiple diseases.

artificial intelligence, machine learning, metric learning, (14 more...)

2506.15383

Country:

North America > United States (0.28)
Europe > Italy (0.28)
Europe > Germany (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.94)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Bakagianni, Juli, Pavlopoulos, John, Likas, Aristidis

TopClustRAG at SIGIR 2025 LiveRAG Challenge

We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.

large language model, machine learning, natural language, (17 more...)

2506.15246

Country: Europe > Greece (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

Zhang, Yiqun, Li, Hao, Wang, Chenxu, Chen, Linyao, Zhang, Qiaosheng, Ye, Peng, Feng, Shi, Wang, Daling, Wang, Zhen, Wang, Xinrun, Xu, Jia, Bai, Lei, Ouyang, Wanli, Hu, Shuyue

Proprietary giants are increasingly dominating the race for ever-larger language models. Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers -- a simple recipe that leverages the collective intelligence of these smaller models. The Avengers builds upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response with repeated sampling. Remarkably, with 10 open-source models (~7B parameters each), the Avengers surpasses GPT-4o, 4.1, and 4.5 in average performance across 15 diverse datasets spanning mathematics, coding, logical reasoning, general knowledge, and affective tasks. In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter -- the number of clusters.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2505.19797

Country: Asia (0.92)

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Chakraborty, Diptarka, Chatterjee, Kushagra, Das, Debarati, Nguyen, Tien Long, Nobahari, Romina

Towards Fair Representation: Clustering and Consensus

arXiv.org Artificial IntelligenceJun-18-2025

Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best represents the collective structure of the data. In this work, we study this fundamental problem through the lens of fair clustering, as introduced by Chierichetti et al. [NeurIPS'17], which incorporates the disparate impact doctrine to ensure proportional representation of each protected group in the dataset within every cluster. Our objective is to find a consensus clustering that is not only representative but also fair with respect to specific protected attributes. To the best of our knowledge, we are the first to address this problem and provide a constant-factor approximation. As part of our investigation, we examine how to minimally modify an existing clustering to enforce fairness -- an essential postprocessing step in many clustering applications that require fair representation. We develop an optimal algorithm for datasets with equal group representation and near-linear time constant factor approximation algorithms for more general scenarios with different proportions of two group sizes. We complement our approximation result by showing that the problem is NP-hard for two unequal-sized groups. Given the fundamental nature of this problem, we believe our results on Closest Fair Clustering could have broader implications for other clustering problems, particularly those for which no prior approximation guarantees exist for their fair variants.

artificial intelligence, dist, machine learning, (19 more...)

2506.08673

Country:

Asia (0.45)
North America > United States (0.27)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

arXiv.org Artificial IntelligenceJun-18-2025

CLGNN: A Contrastive Learning-based GNN Model for Betweenness Centrality Prediction on Temporal Graphs

Zhang, Tianming, Zhang, Renbo, Yang, Zhengyi, Gao, Yunjun, Cao, Bin, Fan, Jing

Temporal Betweenness Centrality (TBC) measures how often a node appears on optimal temporal paths, reflecting its importance in temporal networks. However, exact computation is highly expensive, and real-world TBC distributions are extremely imbalanced. The severe imbalance leads learning-based models to overfit to zero-centrality nodes, resulting in inaccurate TBC predictions and failure to identify truly central nodes. Existing graph neural network (GNN) methods either fail to handle such imbalance or ignore temporal dependencies altogether. To address these issues, we propose a scalable and inductive contrastive learning-based GNN (CLGNN) for accurate and efficient TBC prediction. CLGNN builds an instance graph to preserve path validity and temporal order, then encodes structural and temporal features using dual aggregation, i.e., mean and edge-to-node multi-head attention mechanisms, enhanced by temporal path count and time encodings. A stability-based clustering-guided contrastive module (KContrastNet) is introduced to separate high-, median-, and low-centrality nodes in representation space, mitigating class imbalance, while a regression module (ValueNet) estimates TBC values. CLGNN also supports multiple optimal path definitions to accommodate diverse temporal semantics. Extensive experiments demonstrate the effectiveness and efficiency of CLGNN across diverse benchmarks. CLGNN achieves up to a 663.7~$\times$ speedup compared to state-of-the-art exact TBC computation methods. It outperforms leading static GNN baselines with up to 31.4~$\times$ lower MAE and 16.7~$\times$ higher Spearman correlation, and surpasses state-of-the-art temporal GNNs with up to 5.7~$\times$ lower MAE and 3.9~$\times$ higher Spearman correlation.

artificial intelligence, machine learning, node, (15 more...)

2506.14122

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)