Loukas, Orestis
Categorical Distributions of Maximum Entropy under Marginal Constraints
Loukas, Orestis, Chung, Ho Ryun
The estimation of categorical distributions under marginal constraints summarizing some sample from a population in the most-generalizable way is key for many machine-learning and data-driven approaches. We provide a parameter-agnostic theoretical framework that enables this task ensuring (i) that a categorical distribution of Maximum Entropy under marginal constraints always exists and (ii) that it is unique. The procedure of iterative proportional fitting (IPF) naturally estimates that distribution from any consistent set of marginal constraints directly in the space of probabilities, thus deductively identifying a least-biased characterization of the population. The theoretical framework together with IPF leads to a holistic workflow that enables modeling any class of categorical distributions solely using the phenomenological information provided.
Demographic Parity: Mitigating Biases in Real-World Data
Loukas, Orestis, Chung, Ho-Ryun
Computer-based decision systems are widely used to automate decisions in many aspects of everyday life, which include sensitive areas like hiring, loaning and even criminal sentencing. A decision pipeline heavily relies on large volumes of historical real-world data for training its models. However, historical training data often contains gender, racial or other biases which are propagated to the trained models influencing computer-based decisions. In this work, we propose a robust methodology that guarantees the removal of unwanted biases while maximally preserving classification utility. Our approach can always achieve this in a model-independent way by deriving from real-world data the asymptotic dataset that uniquely encodes demographic parity and realism. As a proof-of-principle, we deduce from public census records such an asymptotic dataset from which synthetic samples can be generated to train well-established classifiers. Benchmarking the generalization capability of these classifiers trained on our synthetic data, we confirm the absence of any explicit or implicit bias in the computer-aided decision.
Entropy-based Characterization of Modeling Constraints
Loukas, Orestis, Chung, Ho Ryun
In most data-scientific approaches, the principle of Maximum Entropy (MaxEnt) is used to a posteriori justify some parametric model which has been already chosen based on experience, prior knowledge or computational simplicity. In a perpendicular formulation to conventional model building, we start from the linear system of phenomenological constraints and asymptotically derive the distribution over all viable distributions that satisfy the provided set of constraints. The MaxEnt distribution plays a special role, as it is the most typical among all phenomenologically viable distributions representing a good expansion point for large-N techniques. This enables us to consistently formulate hypothesis testing in a fully-data driven manner. The appropriate parametric model which is supported by the data can be always deduced at the end of model selection. In the MaxEnt framework, we recover major scores and selection procedures used in multiple applications and assess their ability to capture associations in the data-generating process and identify the most generalizable model. This data-driven counterpart of standard model selection demonstrates the unifying prospective of the deductive logic advocated by MaxEnt principle, while potentially shedding new insights to the inverse problem.