Goto

Collaborating Authors

 Data Science


Discrimination-free Insurance Pricing with Privatized Sensitive Attributes

arXiv.org Machine Learning

Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that are tailored to the insurance domain, using only privatized sensitive attributes. Notably, our approach ensures statistical guarantees, does not require direct access to sensitive attributes, and adapts to varying transparency requirements, addressing regulatory demands while ensuring fairness in insurance pricing.


Bayesian Density-Density Regression with Application to Cell-Cell Communications

arXiv.org Machine Learning

We introduce a scalable framework for regressing multivariate distributions onto multivariate distributions, motivated by the application of inferring cell-cell communication from population-scale single-cell data. The observed data consist of pairs of multivariate distributions for ligands from one cell type and corresponding receptors from another. For each ordered pair $e=(l,r)$ of cell types $(l \neq r)$ and each sample $i = 1, \ldots, n$, we observe a pair of distributions $(F_{ei}, G_{ei})$ of gene expressions for ligands and receptors of cell types $l$ and $r$, respectively. The aim is to set up a regression of receptor distributions $G_{ei}$ given ligand distributions $F_{ei}$. A key challenge is that these distributions reside in distinct spaces of differing dimensions. We formulate the regression of multivariate densities on multivariate densities using a generalized Bayes framework with the sliced Wasserstein distance between fitted and observed distributions. Finally, we use inference under such regressions to define a directed graph for cell-cell communications.


Predictive Multiplicity in Survival Models: A Method for Quantifying Model Uncertainty in Predictive Maintenance Applications

arXiv.org Machine Learning

In many applications, especially those involving prediction, models may yield near-optimal performance yet significantly disagree on individual-level outcomes. This phenomenon, known as predictive multiplicity, has been formally defined in binary, probabilistic, and multi-target classification, and undermines the reliability of predictive systems. However, its implications remain unexplored in the context of survival analysis, which involves estimating the time until a failure or similar event while properly handling censored data. We frame predictive multiplicity as a critical concern in survival-based models and introduce formal measures -- ambiguity, discrepancy, and obscurity -- to quantify it. This is particularly relevant for downstream tasks such as maintenance scheduling, where precise individual risk estimates are essential. Understanding and reporting predictive multiplicity helps build trust in models deployed in high-stakes environments. We apply our methodology to benchmark datasets from predictive maintenance, extending the notion of multiplicity to survival models. Our findings show that ambiguity steadily increases, reaching up to 40-45% of observations; discrepancy is lower but exhibits a similar trend; and obscurity remains mild and concentrated in a few models. These results demonstrate that multiple accurate survival models may yield conflicting estimations of failure risk and degradation progression for the same equipment. This highlights the need to explicitly measure and communicate predictive multiplicity to ensure reliable decision-making in process health management.


An Incremental Non-Linear Manifold Approximation Method

arXiv.org Machine Learning

Analyzing high-dimensional data presents challenges due to the "curse of dimensionality'', making computations intensive. Dimension reduction techniques, categorized as linear or non-linear, simplify such data. Non-linear methods are particularly essential for efficiently visualizing and processing complex data structures in interactive and graphical applications. This research develops an incremental non-linear dimension reduction method using the Geometric Multi-Resolution Analysis (GMRA) framework for streaming data. The proposed method enables real-time data analysis and visualization by incrementally updating the cluster map, PCA basis vectors, and wavelet coefficients. Numerical experiments show that the incremental GMRA accurately represents non-linear manifolds even with small initial samples and aligns closely with batch GMRA, demonstrating efficient updates and maintaining the multiscale structure. The findings highlight the potential of Incremental GMRA for real-time visualization and interactive graphics applications that require adaptive high-dimensional data representations.


Riemannian Optimization on Relaxed Indicator Matrix Manifold

arXiv.org Machine Learning

The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide several methods of Retraction, including a fast Retraction method to obtain geodesics. We point out that the RIM manifold is a generalization of the double stochastic manifold, and it is much faster than existing methods on the double stochastic manifold, which has a complexity of \( \mathcal{O}(n^3) \), while RIM manifold optimization is \( \mathcal{O}(n) \) and often yields better results. We conducted extensive experiments, including image denoising, with millions of variables to support our conclusion, and applied the RIM manifold to Ratio Cut, we provide a rigorous convergence proof and achieve clustering results that outperform the state-of-the-art methods. Our Code in \href{https://github.com/Yuan-Jinghui/Riemannian-Optimization-on-Relaxed-Indicator-Matrix-Manifold}{here}.


Your data's probably not ready for AI - here's how to make it trustworthy

ZDNet

Trust is fragile, and that's one problem with artificial intelligence, which is only as good as the data behind it. Data integrity concerns -- which have vexed even the savviest organizations for decades -- is rearing its head again. And industry experts are sounding the alarm. Users of generative AI may be fed incomplete, duplicative, or erroneous information that comes back to bite them -- thanks to the weak or siloed data underpinning these systems. "AI and gen AI are raising the bar for quality data," according to a recent analysis published by Ashish Verma, chief data and analytics officer at Deloitte US, and a team of co-authors.


Local Distance-Preserving Node Embeddings and Their Performance on Random Graphs

arXiv.org Machine Learning

Learning node representations is a fundamental problem in graph machine learning. While existing embedding methods effectively preserve local similarity measures, they often fail to capture global functions like graph distances. Inspired by Bourgain's seminal work on Hilbert space embeddings of metric spaces (1985), we study the performance of local distance-preserving node embeddings. Known as landmark-based algorithms, these embeddings approximate pairwise distances by computing shortest paths from a small subset of reference nodes (i.e., landmarks). Our main theoretical contribution shows that random graphs, such as Erd\H{o}s-R\'enyi random graphs, require lower dimensions in landmark-based embeddings compared to worst-case graphs. Empirically, we demonstrate that the GNN-based approximations for the distances to landmarks generalize well to larger networks, offering a scalable alternative for graph representation learning.


Performance of Rank-One Tensor Approximation on Incomplete Data

arXiv.org Machine Learning

We are interested in the estimation of a rank-one tensor signal when only a portion $\varepsilon$ of its noisy observation is available. We show that the study of this problem can be reduced to that of a random matrix model whose spectral analysis gives access to the reconstruction performance. These results shed light on and specify the loss of performance induced by an artificial reduction of the memory cost of a tensor via the deletion of a random part of its entries.


Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

arXiv.org Artificial Intelligence

The Model Context Protocol (MCP) is a standardized interface designed to enable seamless interaction between AI models and external tools and resources, breaking down data silos and facilitating interoperability across diverse systems. This paper provides a comprehensive overview of MCP, focusing on its core components, workflow, and the lifecycle of MCP servers, which consists of three key phases: creation, operation, and update. We analyze the security and privacy risks associated with each phase and propose strategies to mitigate potential threats. The paper also examines the current MCP landscape, including its adoption by industry leaders and various use cases, as well as the tools and platforms supporting its integration. We explore future directions for MCP, highlighting the challenges and opportunities that will influence its adoption and evolution within the broader AI ecosystem. Finally, we offer recommendations for MCP stakeholders to ensure its secure and sustainable development as the AI landscape continues to evolve.


Cramer-Rao Bounds for Laplacian Matrix Estimation

arXiv.org Machine Learning

Abstract--In this paper, we analyze the performance of the estimation of Laplacian matrices under general observatio n models. Laplacian matrix estimation involves structural c on-straints, including symmetry and null-space properties, a long with matrix sparsity. By exploiting a linear reparametriza tion that enforces the structural constraints, we derive closed -form matrix expressions for the Cram er-Rao Bound (CRB) specifically tailored to Laplacian matrix estimation. We further extend the derivation to the sparsity-constrained case, introduc ing two oracle CRBs that incorporate prior information of the suppo rt set, i.e. the locations of the nonzero entries in the Laplaci an matrix. We examine the properties and order relations betwe en the bounds, and provide the associated Slepian-Bangs formu la for the Gaussian case. We demonstrate the use of the new CRBs in three representative applications: (i) topology identi fication in power systems, (ii) graph filter identification in diffuse d models, and (iii) precision matrix estimation in Gaussian M arkov random fields under Laplacian constraints. The CRBs are eval - uated and compared with the mean-squared-errors (MSEs) of the constrained maximum likelihood estimator (CMLE), whic h integrates both equality and inequality constraints along with sparsity constraints, and of the oracle CMLE, which knows the locations of the nonzero entries of the Laplacian matrix . We perform this analysis for the applications of power syste m topology identification and graphical LASSO, and demonstra te that the MSEs of the estimators converge to the CRB and oracle CRB, given a sufficient number of measurements. Graph-structured data and signals arise in numerous applications, including power systems, communications, finance, social networks, and biological networks, for analysis and inference of networks [ 2 ], [ 3 ]. In this context, the Laplacian matrix, which captures node connectivity and edge weights, serves as a fundamental tool for clustering [ 4 ], modeling graph diffusion processes [ 5 ], [ 6 ], topology inference [ 6 ]-[ 12 ], anomaly detection [ 13 ], graph-based filtering [ 14 ]-[ 18 ], and analyzing smoothness on graphs [ 19 ]. M. Halihal and T. Routtenberg are with the School of Electric al and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel, e-mail: moradha@post.bgu.ac.il, tirzar@b gu.ac.il.