AITopics

2511.20851

Country:

Asia > India > West Bengal > Kolkata (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > India > NCT > Delhi (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.69)

Industry: Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Fawley, Richard J., de Amorim, Renato Cordeiro

Shapley-Inspired Feature Weighting in $k$-means with No Additional Hyperparameters

arXiv.org Artificial IntelligenceAug-12-2025

Clustering algorithms often assume all features contribute equally to the data structure, an assumption that usually fails in high-dimensional or noisy settings. Feature weighting methods can address this, but most require additional parameter tuning. We propose SHARK (Shapley Reweighted $k$-means), a feature-weighted clustering algorithm motivated by the use of Shapley values from cooperative game theory to quantify feature relevance, which requires no additional parameters beyond those in $k$-means. We prove that the $k$-means objective can be decomposed into a sum of per-feature Shapley values, providing an axiomatic foundation for unsupervised feature relevance and reducing Shapley computation from exponential to polynomial time. SHARK iteratively re-weights features by the inverse of their Shapley contribution, emphasising informative dimensions and down-weighting irrelevant ones. Experiments on synthetic and real-world data sets show that SHARK consistently matches or outperforms existing methods, achieving superior robustness and accuracy, particularly in scenarios where noise may be present. Software: https://github.com/rickfawley/shark.

artificial intelligence, data mining, machine learning, (18 more...)

2508.07952

Country: Europe > Austria (0.28)

Genre:

Research Report (0.82)
Overview (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Fröhlich, Benedikt, Durst, Alison, Behr, Merle

Decorrelated feature importance from local sample weighting

arXiv.org Machine LearningAug-11-2025

Feature importance (FI) statistics provide a prominent and valuable method of insight into the decision process of machine learning (ML) models, but their effectiveness has well-known limitations when correlation is present among the features in the training data. In this case, the FI often tends to be distributed among all features which are in correlation with the response-generating signal features. Even worse, if multiple signal features are in strong correlation with a noise feature, while being only modestly correlated with one another, this can result in a noise feature having a distinctly larger FI score than any signal feature. Here we propose local sample weighting (losaw) which can flexibly be integrated into many ML algorithms to improve FI scores in the presence of feature correlation in the training data. Our approach is motivated from inverse probability weighting in causal inference and locally, within the ML model, uses a sample weighting scheme to decorrelate a target feature from the remaining features. This reduces model bias locally, whenever the effect of a potential signal feature is evaluated and compared to others. Moreover, losaw comes with a natural tuning parameter, the minimum effective sample size of the weighted population, which corresponds to an interpretation-prediction-tradeoff, analog to a bias-variance-tradeoff as for classical ML tuning parameters. We demonstrate how losaw can be integrated within decision tree-based ML methods and within mini-batch training of neural networks. We investigate losaw for random forest and convolutional neural networks in a simulation study on settings showing diverse correlation patterns. We found that losaw improves FI consistently. Moreover, it often improves prediction accuracy for out-of-distribution, while maintaining a similar accuracy for in-distribution test data.

artificial intelligence, feature importance score, machine learning, (16 more...)

2508.06337

Country:

Europe > Germany > Bavaria > Regensburg (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Machine LearningJul-31-2025

Consistency of Feature Attribution in Deep Learning Architectures for Multi-Omics

Claborne, Daniel, Flores, Javier, Erwin, Samantha, Durell, Luke, Richardson, Rachel, Fore, Ruby, Bramer, Lisa

Machine and deep learning have grown in popularity and use in biological research over the last decade but still present challenges in interpretability of the fitted model. The development and use of metrics to determine features driving predictions and increase model i nterpretability continues to be an open area of research. We investigate the use of Shapley Additive Explanations (SHAP) on a multi - view deep learning model applied to multi - omics data for the purposes of identifying biomolecules of interest . Rankings of features via these attribution methods are compared across various architectures to evaluate consistency of the method. We perform multiple computational experiments to assess the robustness of SHAP and investigate modeling approaches and diagnostics to increase and measure the reliability of the identification of important features. Accuracy of a random - forest model fit on subsets of features selected as being most influential as well as clustering quality using o nly these features are used as a measure of enullectiveness of the attribution method. Our findings indicate that the rankings of features resulting from SHAP are sensitive to the choice of architecture as well as dinullerent random initializations of weights, suggesting caution when u sing attribution methods on multi - view deep learning models applied to multi - omics data. We present a n alternative, simple method to assess the robustness of identification of important biomolecules.

artificial intelligence, deep learning, machine learning, (18 more...)

2507.22877

Country:

North America > United States > California (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Zhang, Xudong, de Amorim, Renato Cordeiro

Scalable unsupervised feature selection via weight stability

arXiv.org Artificial IntelligenceJun-16-2025

Unsupervised feature selection is critical for improving clustering performance in high-dimensional data, where irrelevant features can obscure meaningful structure. In this work, we introduce the Minkowski weighted $k$-means++, a novel initialisation strategy for the Minkowski Weighted $k$-means. Our initialisation selects centroids probabilistically using feature relevance estimates derived from the data itself. Building on this, we propose two new feature selection algorithms, FS-MWK++, which aggregates feature weights across a range of Minkowski exponents to identify stable and informative features, and SFS-MWK++, a scalable variant based on subsampling. We support our approach with a theoretical guarantee under mild assumptions and extensive experiments showing that our methods consistently outperform existing alternatives. Our software can be found at https://github.com/xzhang4-ops1/FSMWK.

artificial intelligence, deep learning, machine learning, (20 more...)

2506.06114

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

de Amorim, Renato Cordeiro, Makarenkov, Vladimir

Improving internal cluster quality evaluation in noisy Gaussian mixtures

arXiv.org Machine LearningMar-1-2025

Improving clustering quality evaluation in noisy Gaussian mixtures Renato Cordeiro de Amorim Vladimir Makarenkov Abstract Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets. We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features. The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable. Mila - Quebec AI Institute, Montreal, QC, Canada.Keywords: Cluster validity indices, data rescaling, noisy data. 1 Introduction Clustering is a fundamenta technique in machine learning and data analysis, which is central to many exploratory methods.

dispersion, fir, noise feature, (15 more...)

2503.00379

Country:

North America > Canada > Quebec > Montreal (0.24)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Alameda County > Oakland (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceNov-17-2024

Image Forgery Localization via Guided Noise and Multi-Scale Feature Aggregation

Niu, Yakun, Chen, Pei, Zhang, Lei, Tan, Lei, Chen, Yingjian

Image Forgery Localization (IFL) technology aims to detect and locate the forged areas in an image, which is very important in the field of digital forensics. However, existing IFL methods suffer from feature degradation during training using multi-layer convolutions or the self-attention mechanism, and perform poorly in detecting small forged regions and in robustness against post-processing. To tackle these, we propose a guided and multi-scale feature aggregated network for IFL. Spectifically, in order to comprehensively learn the noise feature under different types of forgery, we develop an effective noise extraction module in a guided way. Then, we design a Feature Aggregation Module (FAM) that uses dynamic convolution to adaptively aggregate RGB and noise features over multiple scales. Moreover, we propose an Atrous Residual Pyramid Module (ARPM) to enhance features representation and capture both global and local features using different receptive fields to improve the accuracy and robustness of forgery localization. Expensive experiments on 5 public datasets have shown that our proposed model outperforms several the state-of-the-art methods, specially on small region forged image.

artificial intelligence, information, machine learning, (15 more...)

2412.01622

Country:

Asia > China > Henan Province (0.14)
North America > United States (0.04)

Genre: Research Report (0.84)

Industry: Information Technology > Security & Privacy (0.88)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Kim, Ye-eun, Kim, Seoung Yun, Kim, Hyunjoong

Heterogeneous Random Forest

arXiv.org Machine LearningOct-24-2024

Random forest (RF) stands out as a highly favored machine learning approach for classification problems. The effectiveness of RF hinges on two key factors: the accuracy of individual trees and the diversity among them. In this study, we introduce a novel approach called heterogeneous RF (HRF), designed to enhance tree diversity in a meaningful way. This diversification is achieved by deliberately introducing heterogeneity during the tree construction. Specifically, features used for splitting near the root node of previous trees are assigned lower weights when constructing the feature sub-space of the subsequent trees. As a result, dominant features in the prior trees are less likely to be employed in the next iteration, leading to a more diverse set of splitting features at the nodes. Through simulation studies, it was confirmed that the HRF method effectively mitigates the selection bias of trees within the ensemble, increases the diversity of the ensemble, and demonstrates superior performance on datasets with fewer noise features. To assess the comparative performance of HRF against other widely adopted ensemble methods, we conducted tests on 52 datasets, comprising both real-world and synthetic data. HRF consistently outperformed other ensemble methods in terms of accuracy across the majority of datasets.

artificial intelligence, ensemble method, machine learning, (18 more...)

2410.19022

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Wisconsin (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.73)

Muslimani, Calarina, Grooten, Bram, Mamillapalli, Deepak Ranganatha Sastry, Pechenizkiy, Mykola, Mocanu, Decebal Constantin, Taylor, Matthew E.

Boosting Robustness in Preference-Based Reinforcement Learning with Dynamic Sparsity

arXiv.org Artificial IntelligenceJun-10-2024

For autonomous agents to successfully integrate into human-centered environments, agents should be able to learn from and adapt to humans in their native settings. Preference-based reinforcement learning (PbRL) is a promising approach that learns reward functions from human preferences. This enables RL agents to adapt their behavior based on human desires. However, humans live in a world full of diverse information, most of which is not relevant to completing a particular task. It becomes essential that agents learn to focus on the subset of task-relevant environment features. Unfortunately, prior work has largely ignored this aspect; primarily focusing on improving PbRL algorithms in standard RL environments that are carefully constructed to contain only task-relevant features. This can result in algorithms that may not effectively transfer to a more noisy real-world setting. To that end, this work proposes R2N (Robust-to-Noise), the first PbRL algorithm that leverages principles of dynamic sparse training to learn robust reward models that can focus on task-relevant features. We study the effectiveness of R2N in the Extremely Noisy Environment setting, an RL problem setting where up to 95% of the state features are irrelevant distractions. In experiments with a simulated teacher, we demonstrate that R2N can adapt the sparse connectivity of its neural networks to focus on task-relevant features, enabling R2N to significantly outperform several state-of-the-art PbRL algorithms in multiple locomotion and control environments.

pebble, timestep, true return true return, (13 more...)

2406.06495

Country:

North America > Canada > Alberta (0.14)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.95)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Madakkatel, Iqbal, Hyppönen, Elina

LLpowershap: Logistic Loss-based Automated Shapley Values Feature Selection Method

arXiv.org Artificial IntelligenceJan-23-2024

Shapley values have been used extensively in machine learning, not only to explain black box machine learning models, but among other tasks, also to conduct model debugging, sensitivity and fairness analyses and to select important features for robust modelling and for further follow-up analyses. Shapley values satisfy certain axioms that promote fairness in distributing contributions of features toward prediction or reducing error, after accounting for non-linear relationships and interactions when complex machine learning models are employed. Recently, a number of feature selection methods utilising Shapley values have been introduced. Here, we present a novel feature selection method, LLpowershap, which makes use of loss-based Shapley values to identify informative features with minimal noise among the selected sets of features. Our simulation results show that LLpowershap not only identifies higher number of informative features but outputs fewer noise features compared to other state-of-the-art feature selection methods. Benchmarking results on four real-world datasets demonstrate higher or at par predictive performance of LLpowershap compared to other Shapley based wrapper methods, or filter methods.

dataset, llpowershap, shapley value, (11 more...)

2401.12683

Country:

Oceania > Australia > South Australia > Adelaide (0.04)
Europe > United Kingdom (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)