AITopics

2605.24673

Country: North America > United States > California (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Nicoletti, Flavio, Ma, Chenxiao, Ventura, Enrico, Saglietti, Luca, Mannelli, Stefano Sarao

The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models

arXiv.org Machine LearningMay-8-2026

Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.

artificial intelligence, data structure, machine learning, (14 more...)

2605.06367

Country: Europe > Italy (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

arXiv.org Machine LearningMay-6-2026

Adaptive graph-based algorithms for conditional anomaly detection and semi-supervised learning

Valko, Michal

We develop graph-based methods for semi-supervised learning based on label propagation on a data similarity graph. When data is abundant or arrive in a stream, the problems of computation and data storage arise for any graph-based method. We propose a fast approximate online algorithm that solves for the harmonic solution on an approximate graph. We show, both empirically and theoretically, that good behavior can be achieved by collapsing nearby points into a set of local representative points that minimize distortion. Moreover, we regularize the harmonic solution to achieve better stability properties. We also present graph-based methods for detecting conditional anomalies and apply them to the identification of unusual clinical actions in hospitals. Our hypothesis is that patient-management actions that are unusual with respect to the past patients may be due to errors and that it is worthwhile to raise an alert if such a condition is encountered. Conditional anomaly detection extends standard unconditional anomaly framework but also faces new problems known as fringe and isolated points. We devise novel nonparametric graph-based methods to tackle these problems. Our methods rely on graph connectivity analysis and soft harmonic solution. Finally, we conduct an extensive human evaluation study of our conditional anomaly methods by 15 experts in critical care.

artificial intelligence, data mining, machine learning, (21 more...)

2605.03495

Country:

Asia (0.92)
Europe (0.92)
North America > United States > Minnesota (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Instructional Material (0.92)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Education (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.46)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.45)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
(5 more...)

Neural Information Processing SystemsApr-25-2026, 09:56:48 GMT

ParK: Sound and Efficient Kernel Ridge Regression by Feature Space Partitions

We introduce ParK, a new large-scale solver for kernel ridge regression. Our approach combines partitioning with random projections and iterative optimization to reduce space and time complexity while provably maintaining the same statistical accuracy. In particular, constructing suitable partitions directly in the feature space rather than in the input space, we promote orthogonality between the local estimators, thus ensuring that key quantities such as local effective dimension and bias remain under control. We characterize the statistical-computational tradeoff of our model, and demonstrate the effectiveness of our method by numerical experiments on large-scale datasets.

artificial intelligence, estimator, machine learning, (15 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.61)

Neural Information Processing SystemsApr-25-2026, 05:46:49 GMT

299dc35e747eb77177d9cea10a802da2-Paper.pdf

artificial intelligence, machine learning, vector, (19 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Information Management > Search (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Neural Information Processing SystemsApr-25-2026, 04:39:01 GMT

15212bd2265c4a3ab0dbc1b1982c1b69-Supplemental-Conference.pdf

artificial intelligence, machine learning, representation, (18 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Neural Information Processing SystemsApr-25-2026, 03:27:46 GMT

25886d7a7cf4e33fd44072a0cd81bf30-Supplemental-Conference.pdf

artificial intelligence, machine learning, natural language, (16 more...)

Country: Asia > China (0.28)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.94)
(2 more...)

Neural Information Processing SystemsApr-24-2026, 13:54:50 GMT

SOAR: Improved Indexing for Approximate Nearest Neighbor Search

This paper introduces SOAR: Spilling with Orthogonality-Amplified Residuals, a novel data indexing technique for approximate nearest neighbor (ANN) search. SOAR extends upon previous approaches to ANN search, such as spill trees, that utilize multiple redundant representations while partitioning the data to reduce the probability of missing a nearest neighbor during search. Rather than training and computing these redundant representations independently, however, SOAR uses an orthogonality-amplified residual loss, which optimizes each representation to compensate for cases where other representations perform poorly. This drastically improves the overall index quality, resulting in state-of-the-art ANN benchmark performance while maintaining fast indexing times and low memory consumption.

information retrieval, machine learning, natural language, (20 more...)

Country: North America > United States (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.51)

Chaoyue Liu, Mikhail Belkin

Clustering with Bregman Divergences: an Asymptotic Analysis

Neural Information Processing SystemsApr-22-2026, 01:55:03 GMT

Clustering, in particular k-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of k-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters k is large. We establish quantization rates and describe the limiting distribution of the centers as k, extending well-known results for k-means clustering.

artificial intelligence, bregman divergence, machine learning, (16 more...)

Country:

North America > United States (0.46)
Europe (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Malik, Vikrant, Kargin, Taylan, Hassibi, Babak

Distributionally Robust K-Means Clustering

arXiv.org Machine LearningApr-14-2026

In recent years, the widespreadavailability of large-scale, high-dimensionaldatasets has driven significant interest in clustering algorithms that are both computationally efficient and robust to distributional shifts and outliers. The classical clustering method, K-means, can be seen as an application of the Lloyd-Max quantization algorithm, in which the distribution being quantized is the empirical distribution of the points to be clustered. This empirical distribution generally differs from the true underlying distribution, especially when the number of points to be clustered is small. This induces a distributional shift, which can also arise in many real-world settings, such as image segmentation, biological data analysis, and sensor networks, due to noise variations, sensor inaccuracies, or environmental changes. Distributional shifts can severely impact the performance of clustering algorithms, leading to degraded cluster assignments and unreliable downstream analysis. The field of clustering has a rich history. One of the most popular algorithms in this field is theK-means (KM) algorithm, introduced by [1], which computes centroids by iteratively updating the conditional mean of the data in the Voronoi regions induced by the centroids. However, standardK-means is sensitive to initialization and, in general, converges only to a local minimum.

algorithm, artificial intelligence, machine learning, (18 more...)