Phillips, Steven J.
Generative and Discriminative Learning with Unknown Labeling Bias
Phillips, Steven J., Dudík, Miroslav
We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropy-based weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of label bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data.
Correcting sample selection bias in maximum entropy density estimation
Dudík, Miroslav, Phillips, Steven J., Schapire, Robert E.
We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one estimates the biased distribution and then factors the bias out. The third one approximates the second by only using samples from the sampling distribution. We provide guarantees for the first two approaches and evaluate the performance of all three approaches in synthetic experiments and on real data from species habitat modeling, where maxent has been successfully applied and where sample selection bias is a significant problem.
Correcting sample selection bias in maximum entropy density estimation
Dudík, Miroslav, Phillips, Steven J., Schapire, Robert E.
We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches.The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one estimates thebiased distribution and then factors the bias out. The third one approximates the second by only using samples from the sampling distribution. Weprovide guarantees for the first two approaches and evaluate the performance of all three approaches in synthetic experiments and on real data from species habitat modeling, where maxent has been successfully appliedand where sample selection bias is a significant problem.