AITopics | Data Science

Collaborating Authors

Data Science

News Overviews Instructional Materials AI-Alerts Classics

Discrimination-free Insurance Pricing with Privatized Sensitive Attributes

arXiv.org Machine LearningApr-16-2025

Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that are tailored to the insurance domain, using only privatized sensitive attributes. Notably, our approach ensures statistical guarantees, does not require direct access to sensitive attributes, and adapts to varying transparency requirements, addressing regulatory demands while ensuring fairness in insurance pricing.

artificial intelligence, machine learning, test loss, (17 more...)

arXiv.org Machine Learning

2504.11775

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.68)

Industry: Banking & Finance > Insurance (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Bayesian Density-Density Regression with Application to Cell-Cell Communications

Nguyen, Khai, Ni, Yang, Mueller, Peter

arXiv.org Machine LearningApr-16-2025

We introduce a scalable framework for regressing multivariate distributions onto multivariate distributions, motivated by the application of inferring cell-cell communication from population-scale single-cell data. The observed data consist of pairs of multivariate distributions for ligands from one cell type and corresponding receptors from another. For each ordered pair $e=(l,r)$ of cell types $(l \neq r)$ and each sample $i = 1, \ldots, n$, we observe a pair of distributions $(F_{ei}, G_{ei})$ of gene expressions for ligands and receptors of cell types $l$ and $r$, respectively. The aim is to set up a regression of receptor distributions $G_{ei}$ given ligand distributions $F_{ei}$. A key challenge is that these distributions reside in distinct spaces of differing dimensions. We formulate the regression of multivariate densities on multivariate densities using a generalized Bayes framework with the sliced Wasserstein distance between fitted and observed distributions. Finally, we use inference under such regressions to define a directed graph for cell-cell communications.

artificial intelligence, bayesian ddr, machine learning, (18 more...)

arXiv.org Machine Learning

2504.12617

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.94)
Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

Predictive Multiplicity in Survival Models: A Method for Quantifying Model Uncertainty in Predictive Maintenance Applications

Cavus, Mustafa

arXiv.org Machine LearningApr-16-2025

In many applications, especially those involving prediction, models may yield near-optimal performance yet significantly disagree on individual-level outcomes. This phenomenon, known as predictive multiplicity, has been formally defined in binary, probabilistic, and multi-target classification, and undermines the reliability of predictive systems. However, its implications remain unexplored in the context of survival analysis, which involves estimating the time until a failure or similar event while properly handling censored data. We frame predictive multiplicity as a critical concern in survival-based models and introduce formal measures -- ambiguity, discrepancy, and obscurity -- to quantify it. This is particularly relevant for downstream tasks such as maintenance scheduling, where precise individual risk estimates are essential. Understanding and reporting predictive multiplicity helps build trust in models deployed in high-stakes environments. We apply our methodology to benchmark datasets from predictive maintenance, extending the notion of multiplicity to survival models. Our findings show that ambiguity steadily increases, reaching up to 40-45% of observations; discrepancy is lower but exhibits a similar trend; and obscurity remains mild and concentrated in a few models. These results demonstrate that multiple accurate survival models may yield conflicting estimations of failure risk and degradation progression for the same equipment. This highlights the need to explicitly measure and communicate predictive multiplicity to ensure reliable decision-making in process health management.

data mining, machine learning, prediction, (20 more...)

arXiv.org Machine Learning

2504.12156

Genre: Research Report > New Finding (1.00)

Industry:

Law > Civil Rights & Constitutional Law (0.35)
Health & Medicine > Consumer Health (0.35)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Incremental Non-Linear Manifold Approximation Method

Hettige, Praveen T. W., Ong, Benjamin W.

arXiv.org Machine LearningApr-11-2025

Analyzing high-dimensional data presents challenges due to the "curse of dimensionality'', making computations intensive. Dimension reduction techniques, categorized as linear or non-linear, simplify such data. Non-linear methods are particularly essential for efficiently visualizing and processing complex data structures in interactive and graphical applications. This research develops an incremental non-linear dimension reduction method using the Geometric Multi-Resolution Analysis (GMRA) framework for streaming data. The proposed method enables real-time data analysis and visualization by incrementally updating the cluster map, PCA basis vectors, and wavelet coefficients. Numerical experiments show that the incremental GMRA accurately represents non-linear manifolds even with small initial samples and aligns closely with batch GMRA, demonstrating efficient updates and maintaining the multiscale structure. The findings highlight the potential of Incremental GMRA for real-time visualization and interactive graphics applications that require adaptive high-dimensional data representations.

approximation, data mining, machine learning, (10 more...)

arXiv.org Machine Learning

2504.09068

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

Riemannian Optimization on Relaxed Indicator Matrix Manifold

Yuan, Jinghui, Xie, Fangyuan, Nie, Feiping, Li, Xuelong

arXiv.org Machine LearningApr-11-2025

The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide several methods of Retraction, including a fast Retraction method to obtain geodesics. We point out that the RIM manifold is a generalization of the double stochastic manifold, and it is much faster than existing methods on the double stochastic manifold, which has a complexity of $ \mathcal{O}(n^3) $, while RIM manifold optimization is $ \mathcal{O}(n) $ and often yields better results. We conducted extensive experiments, including image denoising, with millions of variables to support our conclusion, and applied the RIM manifold to Ratio Cut, we provide a rigorous convergence proof and achieve clustering results that outperform the state-of-the-art methods. Our Code in \href{https://github.com/Yuan-Jinghui/Riemannian-Optimization-on-Relaxed-Indicator-Matrix-Manifold}{here}.

data mining, machine learning, manifold, (18 more...)

arXiv.org Machine Learning

2503.20505

Country: North America > United States > California (0.27)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(3 more...)

Add feedback

Your data's probably not ready for AI - here's how to make it trustworthy

ZDNetApr-10-2025, 17:40:39 GMT

Trust is fragile, and that's one problem with artificial intelligence, which is only as good as the data behind it. Data integrity concerns -- which have vexed even the savviest organizations for decades -- is rearing its head again. And industry experts are sounding the alarm. Users of generative AI may be fed incomplete, duplicative, or erroneous information that comes back to bite them -- thanks to the weak or siloed data underpinning these systems. "AI and gen AI are raising the bar for quality data," according to a recent analysis published by Ashish Verma, chief data and analytics officer at Deloitte US, and a team of co-authors.

artificial intelligence, data mining, data quality, (10 more...)

ZDNet

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science > Data Quality (0.58)
Information Technology > Data Science > Data Mining (0.54)

Add feedback

Local Distance-Preserving Node Embeddings and Their Performance on Random Graphs

Le, My, Ruiz, Luana, Dhara, Souvik

arXiv.org Machine LearningApr-10-2025

Learning node representations is a fundamental problem in graph machine learning. While existing embedding methods effectively preserve local similarity measures, they often fail to capture global functions like graph distances. Inspired by Bourgain's seminal work on Hilbert space embeddings of metric spaces (1985), we study the performance of local distance-preserving node embeddings. Known as landmark-based algorithms, these embeddings approximate pairwise distances by computing shortest paths from a small subset of reference nodes (i.e., landmarks). Our main theoretical contribution shows that random graphs, such as Erd\H{o}s-R\'enyi random graphs, require lower dimensions in landmark-based embeddings compared to worst-case graphs. Empirically, we demonstrate that the GNN-based approximations for the distances to landmarks generalize well to larger networks, offering a scalable alternative for graph representation learning.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2504.08216

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Performance of Rank-One Tensor Approximation on Incomplete Data

Lebeau, Hugo

arXiv.org Machine LearningApr-10-2025

We are interested in the estimation of a rank-one tensor signal when only a portion $\varepsilon$ of its noisy observation is available. We show that the study of this problem can be reduced to that of a random matrix model whose spectral analysis gives access to the reconstruction performance. These results shed light on and specify the loss of performance induced by an artificial reduction of the memory cost of a tensor via the deletion of a random part of its entries.

artificial intelligence, machine learning, reconstruction performance, (18 more...)

arXiv.org Machine Learning

2504.07818

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.47)
Information Technology > Data Science (0.47)

Add feedback

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

Hou, Xinyi, Zhao, Yanjie, Wang, Shenao, Wang, Haoyu

arXiv.org Artificial IntelligenceApr-6-2025

The Model Context Protocol (MCP) is a standardized interface designed to enable seamless interaction between AI models and external tools and resources, breaking down data silos and facilitating interoperability across diverse systems. This paper provides a comprehensive overview of MCP, focusing on its core components, workflow, and the lifecycle of MCP servers, which consists of three key phases: creation, operation, and update. We analyze the security and privacy risks associated with each phase and propose strategies to mitigate potential threats. The paper also examines the current MCP landscape, including its adoption by industry leaders and various use cases, as well as the tools and platforms supporting its integration. We explore future directions for MCP, highlighting the challenges and opportunities that will influence its adoption and evolution within the broader AI ecosystem. Finally, we offer recommendations for MCP stakeholders to ensure its secure and sustainable development as the AI landscape continues to evolve.

data mining, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2503.23278

Country:

Asia (0.68)
North America > United States (0.28)

Genre: Overview (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(3 more...)

Add feedback

Cramer-Rao Bounds for Laplacian Matrix Estimation

Halihal, Morad, Routtenberg, Tirza, Poor, H. Vincent

arXiv.org Machine LearningApr-6-2025

Abstract--In this paper, we analyze the performance of the estimation of Laplacian matrices under general observatio n models. Laplacian matrix estimation involves structural c on-straints, including symmetry and null-space properties, a long with matrix sparsity. By exploiting a linear reparametriza tion that enforces the structural constraints, we derive closed -form matrix expressions for the Cram er-Rao Bound (CRB) specifically tailored to Laplacian matrix estimation. We further extend the derivation to the sparsity-constrained case, introduc ing two oracle CRBs that incorporate prior information of the suppo rt set, i.e. the locations of the nonzero entries in the Laplaci an matrix. We examine the properties and order relations betwe en the bounds, and provide the associated Slepian-Bangs formu la for the Gaussian case. We demonstrate the use of the new CRBs in three representative applications: (i) topology identi fication in power systems, (ii) graph filter identification in diffuse d models, and (iii) precision matrix estimation in Gaussian M arkov random fields under Laplacian constraints. The CRBs are eval - uated and compared with the mean-squared-errors (MSEs) of the constrained maximum likelihood estimator (CMLE), whic h integrates both equality and inequality constraints along with sparsity constraints, and of the oracle CMLE, which knows the locations of the nonzero entries of the Laplacian matrix . We perform this analysis for the applications of power syste m topology identification and graphical LASSO, and demonstra te that the MSEs of the estimators converge to the CRB and oracle CRB, given a sufficient number of measurements. Graph-structured data and signals arise in numerous applications, including power systems, communications, finance, social networks, and biological networks, for analysis and inference of networks [ 2 ], [ 3 ]. In this context, the Laplacian matrix, which captures node connectivity and edge weights, serves as a fundamental tool for clustering [ 4 ], modeling graph diffusion processes [ 5 ], [ 6 ], topology inference [ 6 ]-[ 12 ], anomaly detection [ 13 ], graph-based filtering [ 14 ]-[ 18 ], and analyzing smoothness on graphs [ 19 ]. M. Halihal and T. Routtenberg are with the School of Electric al and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel, e-mail: moradha@post.bgu.ac.il, tirzar@b gu.ac.il.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2504.04576

Country: Asia > Middle East > Israel > Southern District > Beer-Sheva (0.24)

Genre: Research Report (1.00)

Industry:

Information Technology (1.00)
Energy > Power Industry (0.93)

Technology:

Information Technology > Communications (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback