AITopics

2504.15854

Country:

North America > United States > New York > Rensselaer County > Troy (0.04)
North America > Greenland (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Data Science (0.68)

arXiv.org Machine LearningApr-22-2025

A Geometric Approach to Problems in Optimization and Data Science

Manoj, Naren Sarayu

We give new results for problems in computational and statistical machine learning using tools from high-dimensional geometry and probability. We break up our treatment into two parts. In Part I, we focus on computational considerations in optimization. Specifically, we give new algorithms for approximating convex polytopes in a stream, sparsification and robust least squares regression, and dueling optimization. In Part II, we give new statistical guarantees for data science problems. In particular, we formulate a new model in which we analyze statistical properties of backdoor data poisoning attacks, and we study the robustness of graph clustering algorithms to ``helpful'' misspecification.

artificial intelligence, data mining, machine learning, (21 more...)

2504.1627

Country:

North America > United States > California > San Francisco County > San Francisco (0.13)
North America > United States > California > Los Angeles County > Los Angeles (0.13)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(19 more...)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (0.46)
Education > Educational Setting (0.45)
Media > Music (0.45)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
(2 more...)

Hoang, Van Thuy, Jeon, Hyeon-Ju, Lee, O-Joun

Mitigating Degree Bias in Graph Representation Learning with Learnable Structural Augmentation and Structural Self-Attention

arXiv.org Artificial IntelligenceApr-22-2025

Graph Neural Networks (GNNs) update node representations through message passing, which is primarily based on the homophily principle, assuming that adjacent nodes share similar features. However, in real-world graphs with long-tailed degree distributions, high-degree nodes dominate message passing, causing a degree bias where low-degree nodes remain under-represented due to inadequate messages. The main challenge in addressing degree bias is how to discover non-adjacent nodes to provide additional messages to low-degree nodes while reducing excessive messages for high-degree nodes. Nevertheless, exploiting non-adjacent nodes to provide valuable messages is challenging, as it could generate noisy information and disrupt the original graph structures. To solve it, we propose a novel Degree Fairness Graph Transformer, named DegFairGT, to mitigate degree bias by discovering structural similarities between non-adjacent nodes through learnable structural augmentation and structural self-attention. Our key idea is to exploit non-adjacent nodes with similar roles in the same community to generate informative edges under our augmentation, which could provide informative messages between nodes with similar roles while ensuring that the homophily principle is maintained within the community. To enable DegFairGT to learn such structural similarities, we then propose a structural self-attention to capture the similarities between node pairs. To preserve global graph structures and prevent graph augmentation from hindering graph structure, we propose a Self-Supervised Learning task to preserve p-step transition probability and regularize graph augmentation. Extensive experiments on six datasets showed that DegFairGT outperformed state-of-the-art baselines in degree fairness analysis, node classification, and node clustering tasks.

artificial intelligence, machine learning, node, (16 more...)

2504.15075

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

arXiv.org Artificial IntelligenceApr-22-2025

GENE-FL: Gene-Driven Parameter-Efficient Dynamic Federated Learning

Guo, Shunxin, Lv, Jiaqi, Wang, Qiufeng, Geng, Xin

Real-world \underline{F}ederated \underline{L}earning systems often encounter \underline{D}ynamic clients with \underline{A}gnostic and highly heterogeneous data distributions (DAFL), which pose challenges for efficient communication and model initialization. To address these challenges, we draw inspiration from the recently proposed Learngene paradigm, which compresses the large-scale model into lightweight, cross-task meta-information fragments. Learngene effectively encapsulates and communicates core knowledge, making it particularly well-suited for DAFL, where dynamic client participation requires communication efficiency and rapid adaptation to new data distributions. Based on this insight, we propose a Gene-driven parameter-efficient dynamic Federated Learning (GENE-FL) framework. First, local models perform quadratic constraints based on parameters with high Fisher values in the global model, as these parameters are considered to encapsulate generalizable knowledge. Second, we apply the strategy of parameter sensitivity analysis in local model parameters to condense the \textit{learnGene} for interaction. Finally, the server aggregates these small-scale trained \textit{learnGene}s into a robust \textit{learnGene} with cross-task generalization capability, facilitating the rapid initialization of dynamic agnostic client models. Extensive experimental results demonstrate that GENE-FL reduces \textbf{4 $\times$} communication costs compared to FEDAVG and effectively initializes agnostic client models with only about \textbf{9.04} MB.

artificial intelligence, learngene, machine learning, (16 more...)

2504.14628

Country: Europe > Spain (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

arXiv.org Artificial IntelligenceApr-22-2025

Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems

Reddy, Ooha Lakkadi

This thesis employs a hybrid CNN-Transformer architecture, alongside a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (0.635) than to the Bronze Age Proto-Cuneiform (0.102) or Proto-Elamite (0.078). Contrary to expectations, when measured through direct script-to-script embedding comparisons, the Indus script maps closer to Tibetan-Yi Corridor scripts with a mean cosine similarity of 0.930 (CI: [0.917, 0.942]) than to contemporaneous West Asian signaries, which recorded mean similarities of 0.887 (CI: [0.863, 0.911]) and 0.855 (CI: [0.818, 0.891]). Across dimensionality reduction and clustering methods, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. These computational findings align with observed pictorial parallels in numeral systems, gender markers, and iconographic elements. Archaeological evidence of contact networks along the ancient Shu-Shendu road, coinciding with the Indus Civilization's decline, provides a plausible transmission pathway. While alternate explanations cannot be ruled out, the specificity and consistency of similarities suggest more complex cultural transmission networks between South and East Asia than previously recognized.

data mining, machine learning, natural language, (15 more...)

2503.21074

Country:

Europe (1.00)
Asia > India (1.00)
Asia > China (1.00)
North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine (0.92)
Government (0.92)
Information Technology (0.67)
Education > Educational Setting (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Noroozi, Amin, Esha, Sidratul Muntaha, Ghari, Mansoureh

Predictors of Childhood Vaccination Uptake in England: An Explainable Machine Learning Analysis of Longitudinal Regional Data (2021-2024)

arXiv.org Artificial IntelligenceApr-21-2025

Childhood vaccination is a cornerstone of public health, yet disparities in vaccination coverage persist across England. These disparities are shaped by complex interactions among various factors, including geographic, demographic, socioeconomic, and cultural (GDSC) factors. Previous studies mostly rely on cross-sectional data and traditional statistical approaches that assess individual or limited sets of variables in isolation. Such methods may fall short in capturing the dynamic and multivariate nature of vaccine uptake. In this paper, we conducted a longitudinal machine learning analysis of childhood vaccination coverage across 150 districts in England from 2021 to 2024. Using vaccination data from NHS records, we applied hierarchical clustering to group districts by vaccination coverage into low- and high-coverage clusters. A CatBoost classifier was then trained to predict districts' vaccination clusters using their GDSC data. Finally, the SHapley Additive exPlanations (SHAP) method was used to interpret the predictors' importance. The classifier achieved high accuracies of 92.1, 90.6, and 86.3 in predicting districts' vaccination clusters for the years 2021-2022, 2022-2023, and 2023-2024, respectively. SHAP revealed that geographic, cultural, and demographic variables, particularly rurality, English language proficiency, the percentage of foreign-born residents, and ethnic composition, were the most influential predictors of vaccination coverage, whereas socioeconomic variables, such as deprivation and employment, consistently showed lower importance, especially in 2023-2024. Surprisingly, rural districts were significantly more likely to have higher vaccination rates. Additionally, districts with lower vaccination coverage had higher populations whose first language was not English, who were born outside the UK, or who were from ethnic minority groups.

artificial intelligence, coverage cluster, machine learning, (16 more...)

2504.13755

Country:

Europe > United Kingdom > England > Tyne and Wear (0.49)
Europe > United Kingdom > England > Greater London > London (0.49)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Vaccines (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Government > Regional Government > Europe Government > United Kingdom Government (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Rusnak, Alexander, Kaplan, Frédéric

HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering

arXiv.org Artificial IntelligenceApr-21-2025

Traditional 3D scene understanding techniques are generally predicated on hand-annotated label sets, but in recent years a new class of open-vocabulary 3D scene understanding techniques has emerged. Despite the success of this paradigm on small scenes, existing approaches cannot scale efficiently to city-scale 3D datasets. In this paper, we present Hierarchical vocab-Agnostic Expert Clustering (HAEC), after the latin word for 'these', a superpoint graph clustering based approach which utilizes a novel mixture of experts graph transformer for its backbone. We administer this highly scalable approach to the first application of open-vocabulary scene understanding on the SensatUrban city-scale dataset. We also demonstrate a synthetic labeling pipeline which is derived entirely from the raw point clouds with no hand-annotation. Our technique can help unlock complex operations on dense urban 3D scenes and open a new path forward in the processing of digital twins.

artificial intelligence, machine learning, segmentation, (17 more...)

2504.1359

Country: Europe > Switzerland (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Alahyari, Shayan, Domaratzki, Mike

Local distribution-based adaptive oversampling for imbalanced regression

arXiv.org Machine LearningApr-19-2025

Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions that are difficult for machine learning models to predict accurately. This issue particularly affects neural networks, which often struggle with imbalanced data. While class imbalance in classification has been extensively studied, imbalanced regression remains relatively unexplored, with few effective solutions. Existing approaches often rely on arbitrary thresholds to categorize samples as rare or frequent, ignoring the continuous nature of target distributions. These methods can produce synthetic samples that fail to improve model performance and may discard valuable information through undersampling. To address these limitations, we propose LDAO (Local Distribution-based Adaptive Oversampling), a novel data-level approach that avoids categorizing individual samples as rare or frequent. Instead, LDAO learns the global distribution structure by decomposing the dataset into a mixture of local distributions, each preserving its statistical characteristics. LDAO then models and samples from each local distribution independently before merging them into a balanced training set. LDAO achieves a balanced representation across the entire target range while preserving the inherent statistical structure within each local distribution. In extensive evaluations on 45 imbalanced datasets, LDAO outperforms state-of-the-art oversampling methods on both frequent and rare target values, demonstrating its effectiveness for addressing the challenge of imbalanced regression.

artificial intelligence, machine learning, regression, (15 more...)

2504.14316

Country:

North America > United States > New Jersey > Hudson County > Hoboken (0.04)
North America > United States > California (0.04)
North America > Canada > Ontario > Middlesex County > London (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

You, Kisung, Shung, Dennis, Giuffrè, Mauro

Learning over von Mises-Fisher Distributions via a Wasserstein-like Geometry

arXiv.org Machine LearningApr-18-2025

We introduce a novel, geometry-aware distance metric for the family of von Mises-Fisher (vMF) distributions, which are fundamental models for directional data on the unit hypersphere. Although the vMF distribution is widely employed in a variety of probabilistic learning tasks involving spherical data, principled tools for comparing vMF distributions remain limited, primarily due to the intractability of normalization constants and the absence of suitable geometric metrics. Motivated by the theory of optimal transport, we propose a Wasserstein-like distance that decomposes the discrepancy between two vMF distributions into two interpretable components: a geodesic term capturing the angular separation between mean directions, and a variance-like term quantifying differences in concentration parameters. The derivation leverages a Gaussian approximation in the high-concentration regime to yield a tractable, closed-form expression that respects the intrinsic spherical geometry. We show that the proposed distance exhibits desirable theoretical properties and induces a latent geometric structure on the space of non-degenerate vMF distributions. As a primary application, we develop the efficient algorithms for vMF mixture reduction, enabling structure-preserving compression of mixture models in high-dimensional settings. Empirical results on synthetic datasets and real-world high-dimensional embeddings, including biomedical sentence representations and deep visual features, demonstrate the effectiveness of the proposed geometry in distinguishing distributions and supporting interpretable inference. This work expands the statistical toolbox for directional data analysis by introducing a tractable, transport-inspired distance tailored to the geometry of the hypersphere.

data mining, machine learning, natural language, (19 more...)

2504.14164

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Ontario > Toronto (0.14)
(7 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Anton, Cristina, Shreshtth, Roy Shivam Ram

Cluster weighted models with multivariate skewed distributions for functional data

arXiv.org Machine LearningApr-17-2025

Cluster weighted models with multivariate skewed distributions for functional data Cristina Anton, 1 Roy Shivam Ram Shreshtth 2 1 Department of Mathematics and Statistics, MacEwan University, 103C, 10700-104 Ave., Edmonton, AB T5J 4S2, Canada, email: popescuc@macewan.ca 2 Department of Mathematics and Statistics, Indian Institute of Technology Kanpur Abstract We propose a clustering method, funWeightClustSkew, based on mixtures of functional linear regression models and three skewed multivariate distributions: the variance-gamma distribution, the skew-t distribution, and the normal-inverse Gaussian distribution. Our approach follows the framework of the functional high dimensional data clustering (funHDDC) method, and we extend to functional data the cluster weighted models based on skewed distributions used for finite dimensional multivariate data. We consider several parsimonious models, and to estimate the parameters we construct an expectation maximization (EM) algorithm. We illustrate the performance of funWeightClustSkew for simulated data and for the Air Quality dataset. Keywords: Cluster weighted models, Functional linear regression, EM algorithm, Skewed distributions, Multivariate functional principal component analysis 1 Introduction Smart devices and other modern technologies record huge amounts of data measured continuously in time. These data are better represented as curves instead of finite-dimensional vectors, and they are analyzed using statistical methods specific to functional data (Ramsay and Silverman, 2006; Ferraty and Vieu, 2006; Horv ath and Kokoszka, 2012). Many times more than one curve is collected for one individual, e.g.

artificial intelligence, kw 1 2, machine learning, (18 more...)

2504.12683

Country:

North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.24)
North America > United States > New York (0.04)
Europe > Italy (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)