AITopics

2506.13901

Country: Asia > India (0.92)

Genre: Research Report > New Finding (0.67)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Law Enforcement & Public Safety > Terrorism (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Statistical Machine Learning for Astronomy -- A Textbook

Ting, Yuan-Sen

This textbook provides a systematic treatment of statistical machine learning for astronomical research through the lens of Bayesian inference, developing a unified framework that reveals connections between modern data analysis techniques and traditional statistical methods. We show how these techniques emerge from familiar statistical foundations. The consistently Bayesian perspective prioritizes uncertainty quantification and statistical rigor essential for scientific inference in astronomy. The textbook progresses from probability theory and Bayesian inference through supervised learning including linear regression with measurement uncertainties, logistic regression, and classification. Unsupervised learning topics cover Principal Component Analysis and clustering methods. We then introduce computational techniques through sampling and Markov Chain Monte Carlo, followed by Gaussian Processes as probabilistic nonparametric methods and neural networks within the broader statistical context. Our theory-focused pedagogical approach derives each method from first principles with complete mathematical development, emphasizing statistical insight and complementing with astronomical applications. We prioritize understanding why algorithms work, when they are appropriate, and how they connect to broader statistical principles. The treatment builds toward modern techniques including neural networks through a solid foundation in classical methods and their theoretical underpinnings. This foundation enables thoughtful application of these methods to astronomical research, ensuring proper consideration of assumptions, limitations, and uncertainty propagation essential for advancing astronomical knowledge in the era of large astronomical surveys.

bayesian inference, book review, machine learning, (20 more...)

2506.1223

Genre:

Summary/Review (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(3 more...)

Fair Bayesian Model-Based Clustering

Lee, Jihu, Kim, Kunwoong, Kim, Yongdai

Fair clustering has become a socially significant task with the advancement of machine learning technologies and the growing demand for trustworthy AI. Group fairness ensures that the proportions of each sensitive group are similar in all clusters. Most existing group-fair clustering methods are based on the $K$-means clustering and thus require the distance between instances and the number of clusters to be given in advance. To resolve this limitation, we propose a fair Bayesian model-based clustering called Fair Bayesian Clustering (FBC). We develop a specially designed prior which puts its mass only on fair clusters, and implement an efficient MCMC algorithm. Advantages of FBC are that it can infer the number of clusters and can be applied to any data type as long as the likelihood is defined (e.g., categorical data). Experiments on real-world datasets show that FBC (i) reasonably infers the number of clusters, (ii) achieves a competitive utility-fairness trade-off compared to existing fair clustering methods, and (iii) performs well on categorical data.

artificial intelligence, machine learning, mixture model, (17 more...)

2506.12839

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

PROTOCOL: Partial Optimal Transport-enhanced Contrastive Learning for Imbalanced Multi-view Clustering

Xue, Xuqian, Lei, Yiming, Cai, Qi, Shan, Hongming, Zhang, Junping

artificial intelligence, machine learning, partial optimal transport-enhanced contrastive learning, (8 more...)

While contrastive multi-view clustering has achieved remarkable success, it implicitly assumes balanced class distribution. However, real-world multi-view data primarily exhibits class imbalance distribution. Consequently, existing methods suffer performance degradation due to their inability to perceive and model such imbalance. To address this challenge, we present the first systematic study of imbalanced multi-view clustering, focusing on two fundamental problems: i. perceiving class imbalance distribution, and ii. mitigating representation degradation of minority samples. We propose PROTOCOL, a novel PaRtial Optimal TranspOrt-enhanced COntrastive Learning framework for imbalanced multi-view clustering. First, for class imbalance perception, we map multi-view features into a consensus space and reformulate the imbalanced clustering as a partial optimal transport (POT) problem, augmented with progressive mass constraints and weighted KL divergence for class distributions. Second, we develop a POT-enhanced class-rebalanced contrastive learning at both feature and class levels, incorporating logit adjustment and class-sensitive learning to enhance minority sample representations. Extensive experiments demonstrate that PROTOCOL significantly improves clustering performance on imbalanced multi-view data, filling a critical research gap in this field.

2506.12408

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > Canada (0.04)
Asia > Singapore (0.04)
Asia > China > Shandong Province > Qingdao (0.04)

Genre:

Research Report (1.00)
Overview (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Even, Bertrand, Giraud, Christophe, Verzelen, Nicolas

Computational lower bounds in latent models: clustering, sparse-clustering, biclustering

In high-dimensional statistics, the primary goal is to derive computationally efficient estimation procedures, achieving the best possible statistical performance. Y et, in many problems, such as sparse PCA, planted clique or clustering, the best known algorithms with polynomial-time complexity are unable to match the performances provably achievable by the best estimators (without computational constraints). This observation has lead to several conjectures on the existence of gaps (called statistical-computational gaps) between the optimal statistical performance, i.e. the best performance achievable without computational constraints, and the best performance achievable by polynomial time algorithms. In particular, to assess the quality of a computationally efficient algorithm for a given task, the theoretical performance should not be compared to the optimal statistical performance (without computational constraints), but to the performance of the best poly-time algorithm. This raises the problem of establishing lower-bounds on the performance of the best poly-time algorithms for a wide range of problems. Since high-dimensional statistics deal with random instances, the classical notions of worst-case hardness, such as P, NP, etc are not suitable for the high-dimensional framework. Instead, lower bounds are obtained for some specific models of computations, such as SoS [38, 10], overlap gap property [32], statistical query [41, 13], and low-degree polynomials [37, 44, 66], possibly combined with reductions between different statistical problems [12, 11, 14].

artificial intelligence, machine learning, probability, (18 more...)

2506.13647

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
Europe > France > Occitanie > Hérault > Montpellier (0.04)

Genre:

Research Report (1.00)
Workflow (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

EBS-CFL: Efficient and Byzantine-robust Secure Clustered Federated Learning

Li, Zhiqiang, Bao, Haiyong, Guan, Menghong, Pan, Hao, Huang, Cheng, Dai, Hong-Ning

Despite federated learning (FL)'s potential in collaborative learning, its performance has deteriorated due to the data heterogeneity of distributed users. Recently, clustered federated learning (CFL) has emerged to address this challenge by partitioning users into clusters according to their similarity. However, CFL faces difficulties in training when users are unwilling to share their cluster identities due to privacy concerns. To address these issues, we present an innovative Efficient and Robust Secure Aggregation scheme for CFL, dubbed EBS-CFL. The proposed EBS-CFL supports effectively training CFL while maintaining users' cluster identity confidentially. Moreover, it detects potential poisonous attacks without compromising individual client gradients by discarding negatively correlated gradients and aggregating positively correlated ones using a weighted approach. The server also authenticates correct gradient encoding by clients. EBS-CFL has high efficiency with client-side overhead O(ml + m^2) for communication and O(m^2l) for computation, where m is the number of cluster identities, and l is the gradient size. When m = 1, EBS-CFL's computational efficiency of client is at least O(log n) times better than comparison schemes, where n is the number of clients.In addition, we validate the scheme through extensive experiments. Finally, we theoretically prove the scheme's security.

artificial intelligence, gradient, machine learning, (17 more...)

doi: 10.1609/aaai.v39i17.34046

2506.13612

Country:

Asia > China (0.28)
Europe > Austria (0.28)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Fan, Chenglin, Shin, Kijun

Learning Augmented Graph $k$-Clustering

Clustering is a cornerstone of unsupervised machine learning, widely applied in fields such as data organization, anomaly detection, and community detection in networks [Xu and Wunsch, 2005]. Among clustering problems, the k -means and k -median problems stand out as fundamental due to their simplicity and effectiveness. Traditional algorithms aim to partition data into k clusters, minimizing either the sum of squared distances (k-means) or the sum of absolute distances (k-median) to their respective cluster centers. The k -means algorithm has been a cornerstone of clustering research for decades, tracing its roots to foundational works by [MacQueen, 1967] and [Lloyd, 1982], who introduced the iterative optimization approach still used today. Extensions by [Hartigan and Wong, 1979] improved convergence, while [Forgy, 1965] proposed widely-used initialization techniques. The optimization principles underlying k -means were influenced by earlier algorithmic developments, such as Floyd's contributions to optimization [Floyd, 1962]. Improvements include k -means++ [Arthur and Vassilvitskii, 2007], which introduced a probabilistic seeding strategy to improve initialization quality and convergence, and Mini-Batch k -means[Sculley, 2010], which enabled clustering on massive datasets with reduced computational overhead.

algorithm, artificial intelligence, machine learning, (13 more...)

2506.13533

Genre: Research Report (0.51)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Semoglou, Aggelos, Likas, Aristidis, Pavlopoulos, John

Silhouette-Guided Instance-Weighted k-means

Clustering is a fundamental unsupervised learning task with numerous applications across diverse fields. Popular algorithms such as k-means often struggle with outliers or imbalances, leading to distorted centroids and suboptimal partitions. We introduce K-Sil, a silhouette-guided refinement of the k-means algorithm that weights points based on their silhouette scores, prioritizing well-clustered instances while suppressing borderline or noisy regions. The algorithm emphasizes user-specified silhouette aggregation metrics: macro-, micro-averaged or a combination, through self-tuning weighting schemes, supported by appropriate sampling strategies and scalable approximations. These components ensure computational efficiency and adaptability to diverse dataset geometries. Theoretical guarantees establish centroid convergence, and empirical validation on synthetic and real-world datasets demonstrates statistically significant improvements in silhouette scores over k-means and two other instance-weighted k-means variants. These results establish K-Sil as a principled alternative for applications demanding high-quality, well-separated clusters.

artificial intelligence, machine learning, silhouette score, (18 more...)

2506.12878

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Zhao, Weiying, Unagaev, Aleksei, Efremova, Natalia

Data-Driven Soil Organic Carbon Sampling: Integrating Spectral Clustering with Conditioned Latin Hypercube Optimization

Soil organic carbon (SOC) monitoring often relies on selecting representative field sampling locations based on environmental covariates. We propose a novel hybrid methodology that integrates spectral clustering - an unsupervised machine learning technique with conditioned Latin hypercube sampling (cLHS) to enhance the representativeness of SOC sampling. In our approach, spectral clustering partitions the study area into $K$ homogeneous zones using multivariate covariate data, and cLHS is then applied within each zone to select sampling locations that collectively capture the full diversity of environmental conditions. This hybrid spectral-cLHS method ensures that even minor but important environmental clusters are sampled, addressing a key limitation of vanilla cLHS which can overlook such areas. We demonstrate on a real SOC mapping dataset that spectral-cLHS provides more uniform coverage of covariate feature space and spatial heterogeneity than standard cLHS. This improved sampling design has the potential to yield more accurate SOC predictions by providing better-balanced training data for machine learning models.

artificial intelligence, feature space, machine learning, (13 more...)

2506.10419

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Hoang, Cuong Manh, Lee, Yeejin, Kang, Byeongkeun

Unsupervised Contrastive Learning Using Out-Of-Distribution Data for Long-Tailed Dataset

This work addresses the task of self-supervised learning (SSL) on a long-tailed dataset that aims to learn balanced and well-separated representations for downstream tasks such as image classification. This task is crucial because the real world contains numerous object categories, and their distributions are inherently imbalanced. Towards robust SSL on a class-imbalanced dataset, we investigate leveraging a network trained using unlabeled out-of-distribution (OOD) data that are prevalently available online. We first train a network using both in-domain (ID) and sampled OOD data by back-propagating the proposed pseudo semantic discrimination loss alongside a domain discrimination loss. The OOD data sampling and loss functions are designed to learn a balanced and well-separated embedding space. Subsequently, we further optimize the network on ID data by unsupervised contrastive learning while using the previously trained network as a guiding network. The guiding network is utilized to select positive/ negative samples and to control the strengths of attractive /repulsive forces in contrastive learning. We also distil and transfer its embedding space to the training network to maintain balancedness and separability. Through experiments on four publicly available long-tailed datasets, we demonstrate that the proposed method outperforms previous state-of-the-art methods. Introduction Self-supervised learning (SSL) is an important research topic because it enables learning representations without human-annotated labels.

artificial intelligence, dataset, machine learning, (16 more...)

2506.12698

Country: Asia > South Korea (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)