Goto

Collaborating Authors

 data bias


Information-Geometric Decomposition of Generalization Error in Unsupervised Learning

Kim, Gilhan

arXiv.org Machine Learning

We decompose the Kullback--Leibler generalization error (GE) -- the expected KL divergence from the data distribution to the trained model -- of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to $ε$-PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank $N_K$ and discarded directions are pinned at a fixed noise floor $ε$. Although rank-constrained $ε$-PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff $λ_{\mathrm{cut}}^{*} = ε$ -- the model retains exactly those empirical eigenvalues exceeding the noise floor -- with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram -- retain-all, interior, and collapse -- separated by the lower Marchenko--Pastur edge and an analytically computable collapse threshold $ε_{*}(α)$, where $α$ is the dimension-to-sample-size ratio. All claims are verified numerically.





Adaptive Data Debiasing through Bounded Exploration

Neural Information Processing Systems

Biases in existing datasets used to train algorithmic decision rules can raise ethical and economic concerns due to the resulting disparate treatment of different groups. We propose an algorithm for sequentially debiasing such datasets through adaptive and bounded exploration in a classification problem with costly and censored feedback. Exploration in this context means that at times, and to a judiciously-chosen extent, the decision maker deviates from its (current) loss-minimizing rule, and instead accepts some individuals that would otherwise be rejected, so as to reduce statistical data biases. Our proposed algorithm includes parameters that can be used to balance between the ultimate goal of removing data biases -- which will in turn lead to more accurate and fair decisions, and the exploration risks incurred to achieve this goal. We analytically show that such exploration can help debias data in certain distributions. We further investigate how fairness criteria can work in conjunction with our data debiasing algorithm. We illustrate the performance of our algorithm using experiments on synthetic and real-world datasets.




MuPlon: Multi-Path Causal Optimization for Claim Verification through Controlling Confounding

Guo, Hanghui, Di, Shimin, De Meo, Pasquale, Chen, Zhangze, Zhu, Jia

arXiv.org Artificial Intelligence

Abstract--As a critical task in data quality control, claim verification aims to curb the spread of misinformation by assessing the truthfulness of claims based on a wide range of evidence. However, traditional methods often overlook the complex interactions between evidence, leading to unreliable verification results. A straightforward solution represents the claim and evidence as a fully connected graph, which we define as the Claim-Evidence Graph (C-E Graph). Nevertheless, claim verification methods based on fully connected graphs face two primary confounding challenges, Data Noise and Data Biases. T o address these challenges, we propose a novel framework, Multi-Path Causal Optimization (MuPlon). In the front-door path, MuPlon extracts highly relevant subgraphs and constructs reasoning paths, further applying counterfactual reasoning to eliminate data biases within these paths. The experimental results demonstrate that MuPlon outperforms existing methods and achieves state-of-the-art performance.