linear regression model
- North America > United States (0.29)
- Asia > Indonesia > Bali (0.04)
- North America > Canada (0.04)
- Africa > South Africa (0.04)
- Government (0.69)
- Law (0.46)
Hunting for Discriminatory Proxies in Linear Regression Models
Samuel Yeom, Anupam Datta, Matt Fredrikson
A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a "business necessity". Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Connecticut (0.04)
- (3 more...)
Hunting for Discriminatory Proxies in Linear Regression Models
Samuel Yeom, Anupam Datta, Matt Fredrikson
A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a "business necessity". Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Connecticut (0.04)
- (3 more...)
A Additional Discussions
In this work, we focus on the optimal membership inference adversary. The following lemmas are used in the proofs of Theorems 3.2 and D.1. We start with the first term. In the last line, we use the same argument as in Section 2.2 of [ Adding this with the result for the first term gives the desired result. Thus, the regressor memorizes the training data and the training error is equal to zero.
EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions
In a mixture of linear regression model, the regression coefficients are treated as random vectors that may follow either a continuous or discrete distribution. We propose two Expectation-Maximization (EM) algorithms to estimate this prior distribution. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE). This method not only recovers continuous prior distributions but also accurately estimates the number of clusters when the prior is discrete. The second algorithm, designed to approximate the NPMLE, targets prior distributions with a density. It also performs well for discrete priors when combined with a post-processing step. We study the convergence properties of both algorithms and demonstrate their effectiveness through simulations and applications to real datasets.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.87)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.87)
Supplementary material 1 Dataset documentation
In this section, we follow the Datasheets for Datasets framework Gebru et al., 2020 to document the Who created the dataset (e.g., which team, research group) and on behalf of which Who funded the creation of the dataset? This work is funded by Digital Futures in the project EO-AI4GlobalChange. What do the instances that comprise the dataset represent (e.g., documents, photos, Each instance is one image consisting of 23 channels. How many instances are there in total (of each type, if appropriate)? There are 13 607 images in total.
- North America > United States (0.29)
- Asia > Indonesia > Bali (0.04)
- North America > Canada (0.04)
- Africa > South Africa (0.04)
- Government (0.69)
- Law (0.46)
On Sparse Gaussian Chain Graph Models
In this paper, we address the problem of learning the structure of Gaussian chain graph models in a high-dimensional space. Chain graph models are generalizations of undirected and directed graphical models that contain a mixed set of directed and undirected edges. While the problem of sparse structure learning has been studied extensively for Gaussian graphical models and more recently for conditional Gaussian graphical models (CGGMs), there has been little previous work on the structure recovery of Gaussian chain graph models. We consider linear regression models and a re-parameterization of the linear regression models using CGGMs as building blocks of chain graph models. We argue that when the goal is to recover model structures, there are many advantages of using CGGMs as chain component models over linear regression models, including convexity of the optimization problem, computational efficiency, recovery of structured sparsity, and ability to leverage the model structure for semi-supervised learning. We demonstrate our approach on simulated and genomic datasets.
Linear Regression under Missing or Corrupted Coordinates
Diakonikolas, Ilias, Diakonikolas, Jelena, Kane, Daniel M., Lee, Jasper C. H., Pittas, Thanasis
We study multivariate linear regression under Gaussian covariates in two settings, where data may be erased or corrupted by an adversary under a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $η$-fraction of samples per coordinate; a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even information-theoretically. Unlike the clean setting, where estimation error vanishes with more samples, here the optimal error remains a positive function of the problem parameters. Our main contribution is to characterize this error up to constant factors across essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error that match the error of (computationally efficient) algorithms. A key implication is that, perhaps surprisingly, the optimal error in the missing data setting matches that in the corruption setting-so knowing the corruption locations offers no general advantage.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (8 more...)
- Government > Military (0.45)
- Education > Educational Setting > Online (0.45)
- Government > Regional Government (0.45)