disaggregated evaluation
SureMap: Simultaneous mean estimation for single-task and multi-task disaggregated evaluation
Disaggregated evaluation--estimation of performance of a machine learning model on different subpopulations--is a core task when assessing performance and group-fairness of AI systems.A key challenge is that evaluation data is scarce, and subpopulations arising from intersections of attributes (e.g., race, sex, age) are often tiny.Today, it is common for multiple clients to procure the same AI model from a model developer, and the task of disaggregated evaluation is faced by each customer individually. This gives rise to what we call the, wherein multiple clients seek to conduct a disaggregated evaluation of a given model in their own data setting (task).
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > California (0.04)
- North America > Puerto Rico (0.04)
- (2 more...)
Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness
Pfohl, Stephen R., Harris, Natalie, Nagpal, Chirag, Madras, David, Mhasawade, Vishwali, Salaudeen, Olawale, Dieng, Awa, Sequeira, Shannon, Arciniegas, Santiago, Sung, Lillian, Ezeanochie, Nnamdi, Cole-Lewis, Heather, Heller, Katherine, Koyejo, Sanmi, D'Amour, Alexander
Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to predict metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation.
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (8 more...)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.68)
- Health & Medicine > Diagnostic Medicine > Imaging (0.67)
SureMap: Simultaneous mean estimation for single-task and multi-task disaggregated evaluation
Disaggregated evaluation--estimation of performance of a machine learning model on different subpopulations--is a core task when assessing performance and group-fairness of AI systems.A key challenge is that evaluation data is scarce, and subpopulations arising from intersections of attributes (e.g., race, sex, age) are often tiny.Today, it is common for multiple clients to procure the same AI model from a model developer, and the task of disaggregated evaluation is faced by each customer individually. This gives rise to what we call the multi-task disaggregated evaluation problem, wherein multiple clients seek to conduct a disaggregated evaluation of a given model in their own data setting (task). In this work we develop a disaggregated evaluation method called SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. SureMap's efficiency gains come from(1) transforming the problem into structured simultaneous Gaussian mean estimation and (2) incorporating external data, e.g., from the AI system creator or from their other clients. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE).We evaluate SureMap on disaggregated evaluation tasks in multiple domains, observing significant accuracy improvements over several strong competitors.
SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation
Khodak, Mikhail, Mackey, Lester, Chouldechova, Alexandra, Dudík, Miroslav
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems. A key challenge is that evaluation data is scarce, and subpopulations arising from intersections of attributes (e.g., race, sex, age) are often tiny. Today, it is common for multiple clients to procure the same AI model from a model developer, and the task of disaggregated evaluation is faced by each customer individually. This gives rise to what we call the multi-task disaggregated evaluation problem, wherein multiple clients seek to conduct a disaggregated evaluation of a given model in their own data setting (task). In this work we develop a disaggregated evaluation method called SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. SureMap's efficiency gains come from (1) transforming the problem into structured simultaneous Gaussian mean estimation and (2) incorporating external data, e.g., from the AI system creator or from their other clients. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE). We evaluate SureMap on disaggregated evaluation tasks in multiple domains, observing significant accuracy improvements over several strong competitors.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > California (0.04)
- North America > Puerto Rico (0.04)
- (2 more...)
A structured regression approach for evaluating model performance across intersectional subgroups
Herlihy, Christine, Truong, Kimberly, Chouldechova, Alexandra, Dudik, Miroslav
Disaggregated evaluation is a central task in AI fairness assessment, with the goal to measure an AI system's performance across different subgroups defined by combinations of demographic or other sensitive attributes. The standard approach is to stratify the evaluation data across subgroups and compute performance metrics separately for each group. However, even for moderately-sized evaluation datasets, sample sizes quickly get small once considering intersectional subgroups, which greatly limits the extent to which intersectional groups are considered in many disaggregated evaluations. In this work, we introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups. We also provide corresponding inference strategies for constructing confidence intervals and explore how goodness-of-fit testing can yield insight into the structure of fairness-related harms experienced by intersectional groups. We evaluate our approach on two publicly available datasets, and several variants of semi-synthetic data. The results show that our method is considerably more accurate than the standard approach, especially for small subgroups, and goodness-of-fit testing helps identify the key factors that drive differences in performance.
- North America > United States > Oregon (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Assessing the Fairness of AI Systems: AI Practitioners' Processes, Challenges, and Needs for Support
Madaio, Michael, Egede, Lisa, Subramonyam, Hariharan, Vaughan, Jennifer Wortman, Wallach, Hanna
Various tools and practices have been developed to support practitioners in identifying, assessing, and mitigating fairness-related harms caused by AI systems. However, prior research has highlighted gaps between the intended design of these tools and practices and their use within particular contexts, including gaps caused by the role that organizational factors play in shaping fairness work. In this paper, we investigate these gaps for one such practice: disaggregated evaluations of AI systems, intended to uncover performance disparities between demographic groups. By conducting semi-structured interviews and structured workshops with thirty-three AI practitioners from ten teams at three technology companies, we identify practitioners' processes, challenges, and needs for support when designing disaggregated evaluations. We find that practitioners face challenges when choosing performance metrics, identifying the most relevant direct stakeholders and demographic groups on which to focus, and collecting datasets with which to conduct disaggregated evaluations. More generally, we identify impacts on fairness work stemming from a lack of engagement with direct stakeholders, business imperatives that prioritize customers over marginalized groups, and the drive to deploy AI systems at scale.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > New York > Richmond County > New York City (0.04)
- North America > United States > New York > Queens County > New York City (0.04)
- (8 more...)
Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs
Barocas, Solon, Guo, Anhong, Kamar, Ece, Krones, Jacquelyn, Morris, Meredith Ringel, Vaughan, Jennifer Wortman, Wadsworth, Duncan, Wallach, Hanna
Several pieces of work have uncovered performance disparities by conducting "disaggregated evaluations" of AI systems. We build on these efforts by focusing on the choices that must be made when designing a disaggregated evaluation, as well as some of the key considerations that underlie these design choices and the tradeoffs between these considerations. We argue that a deeper understanding of the choices, considerations, and tradeoffs involved in designing disaggregated evaluations will better enable researchers, practitioners, and the public to understand the ways in which AI systems may be underperforming for particular groups of people.
- Oceania > Australia (0.28)
- North America > United States > New York (0.04)
- North America > United States > Michigan (0.04)
- (8 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.68)
- (2 more...)