data bias
Information-Geometric Decomposition of Generalization Error in Unsupervised Learning
We decompose the Kullback--Leibler generalization error (GE) -- the expected KL divergence from the data distribution to the trained model -- of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to $ε$-PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank $N_K$ and discarded directions are pinned at a fixed noise floor $ε$. Although rank-constrained $ε$-PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff $λ_{\mathrm{cut}}^{*} = ε$ -- the model retains exactly those empirical eigenvalues exceeding the noise floor -- with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram -- retain-all, interior, and collapse -- separated by the lower Marchenko--Pastur edge and an analytically computable collapse threshold $ε_{*}(α)$, where $α$ is the dimension-to-sample-size ratio. All claims are verified numerically.
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- Europe > Russia (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.70)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- North America > United States > California (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.69)
- North America > United States > Ohio (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Adaptive Data Debiasing through Bounded Exploration
Biases in existing datasets used to train algorithmic decision rules can raise ethical and economic concerns due to the resulting disparate treatment of different groups. We propose an algorithm for sequentially debiasing such datasets through adaptive and bounded exploration in a classification problem with costly and censored feedback. Exploration in this context means that at times, and to a judiciously-chosen extent, the decision maker deviates from its (current) loss-minimizing rule, and instead accepts some individuals that would otherwise be rejected, so as to reduce statistical data biases. Our proposed algorithm includes parameters that can be used to balance between the ultimate goal of removing data biases -- which will in turn lead to more accurate and fair decisions, and the exploration risks incurred to achieve this goal. We analytically show that such exploration can help debias data in certain distributions. We further investigate how fairness criteria can work in conjunction with our data debiasing algorithm. We illustrate the performance of our algorithm using experiments on synthetic and real-world datasets.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- North America > United States > Ohio (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
MuPlon: Multi-Path Causal Optimization for Claim Verification through Controlling Confounding
Guo, Hanghui, Di, Shimin, De Meo, Pasquale, Chen, Zhangze, Zhu, Jia
Abstract--As a critical task in data quality control, claim verification aims to curb the spread of misinformation by assessing the truthfulness of claims based on a wide range of evidence. However, traditional methods often overlook the complex interactions between evidence, leading to unreliable verification results. A straightforward solution represents the claim and evidence as a fully connected graph, which we define as the Claim-Evidence Graph (C-E Graph). Nevertheless, claim verification methods based on fully connected graphs face two primary confounding challenges, Data Noise and Data Biases. T o address these challenges, we propose a novel framework, Multi-Path Causal Optimization (MuPlon). In the front-door path, MuPlon extracts highly relevant subgraphs and constructs reasoning paths, further applying counterfactual reasoning to eliminate data biases within these paths. The experimental results demonstrate that MuPlon outperforms existing methods and achieves state-of-the-art performance.
- Research Report > Strength High (0.46)
- Research Report > Experimental Study (0.46)
- North America > United States > California (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.69)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- (9 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.98)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)