fairface
Utility-Fairness Trade-Offs and How to Find Them
Dehdashtian, Sepehr, Sadeghi, Bashir, Boddeti, Vishnu Naresh
When building classification systems with demographic fairness considerations, there are two objectives to satisfy: 1) maximizing utility for the specific task and 2) ensuring fairness w.r.t. a known demographic attribute. These objectives often compete, so optimizing both can lead to a trade-off between utility and fairness. While existing works acknowledge the trade-offs and study their limits, two questions remain unanswered: 1) What are the optimal trade-offs between utility and fairness? and 2) How can we numerically quantify these trade-offs from data for a desired prediction task and demographic attribute of interest? This paper addresses these questions. We introduce two utility-fairness trade-offs: the Data-Space and Label-Space Trade-off. The trade-offs reveal three regions within the utility-fairness plane, delineating what is fully and partially possible and impossible. We propose U-FaTE, a method to numerically quantify the trade-offs for a given prediction task and group fairness definition from data samples. Based on the trade-offs, we introduce a new scheme for evaluating representations. An extensive evaluation of fair representation learning methods and representations from over 1000 pre-trained models revealed that most current approaches are far from the estimated and achievable fairness-utility trade-offs across multiple datasets and prediction tasks.
- North America > United States > Washington (0.04)
- North America > United States > Michigan (0.04)
- Europe > France (0.04)
AI-generated faces free from racial and gender stereotypes
AlDahoul, Nouar, Rahwan, Talal, Zaki, Yasir
Text-to-image generative AI models such as Stable Diffusion are used daily by millions worldwide. However, many have raised concerns regarding how these models amplify racial and gender stereotypes. To study this phenomenon, we develop a classifier to predict the race, gender, and age group of any given face image, and show that it achieves state-of-the-art performance. Using this classifier, we quantify biases in Stable Diffusion across six races, two genders, five age groups, 32 professions, and eight attributes. We then propose novel debiasing solutions that outperform state-of-the-art alternatives. Additionally, we examine the degree to which Stable Diffusion depicts individuals of the same race as being similar to one another. This analysis reveals a high degree of stereotyping, e.g., depicting most middle eastern males as being dark-skinned, bearded, and wearing a traditional headdress. We address these limitations by proposing yet another novel solution that increases facial diversity across genders and racial groups. Our solutions are open-sourced and made publicly available.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- South America > Brazil (0.04)
- (2 more...)
- Law Enforcement & Public Safety (0.69)
- Information Technology (0.46)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)
Benchmarking the Fairness of Image Upsampling Methods
Laszkiewicz, Mike, Daunhawer, Imant, Vogt, Julia E., Fischer, Asja, Lederer, Johannes
Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)
Linking convolutional kernel size to generalization bias in face analysis CNNs
Liang, Hao, Caro, Josue Ortega, Maheshri, Vikram, Patel, Ankit B., Balakrishnan, Guha
Training dataset biases are by far the most scrutinized factors when explaining algorithmic biases of neural networks. In contrast, hyperparameters related to the neural network architecture have largely been ignored even though different network parameterizations are known to induce different implicit biases over learned features. For example, convolutional kernel size is known to affect the frequency content of features learned in CNNs. In this work, we present a causal framework for linking an architectural hyperparameter to out-of-distribution algorithmic bias. Our framework is experimental, in that we train several versions of a network with an intervention to a specific hyperparameter, and measure the resulting causal effect of this choice on performance bias when a particular out-of-distribution image perturbation is applied. In our experiments, we focused on measuring the causal relationship between convolutional kernel size and face analysis classification bias across different subpopulations (race/gender), with respect to high-frequency image details. We show that modifying kernel size, even in one layer of a CNN, changes the frequency content of learned features significantly across data subgroups leading to biased generalization performance even in the presence of a balanced dataset.
Fairness-Aware Domain Generalization under Covariate and Dependence Shifts
Zhao, Chen, Jiang, Kai, Wu, Xintao, Wang, Haoliang, Khan, Latifur, Grant, Christan, Chen, Feng
While modern fairness-aware machine learning techniques have demonstrated significant success in various applications [1, 2, 3], their primary objective is to facilitate equitable decision-making, ensuring fairness across all demographic groups, regardless of sensitive attributes, such as race and gender. Nevertheless, state-of-the-art methods can encounter severe shortcomings during the inference phase, mainly due to poor generalization when the spurious correlation deviates from the patterns seen in the training data. This correlation can manifest either between model outcomes and sensitive attributes [4, 5] or between model outcomes and non-semantic data features [6]. This issue originates from the existence of out-of-distribution (OOD) data, resulting in catastrophic failures. Over the past decade, the machine learning community has made significant strides in studying the OOD generalization (or domain generalization, DG) problem and attributing the cause of the poor generalization to the distribution shifts from source domains to target domains. There are two dominant shift types [7]: concept shift and covariate shift. Concept shift refers to OOD samples drawn from a distribution with semantic change e.g., dog v.s.
- North America > United States > Texas > Dallas County > Richardson (0.14)
- North America > United States > New York (0.04)
- North America > United States > Texas > McLennan County > Waco (0.04)
- (3 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.48)
Balancing Act: Constraining Disparate Impact in Sparse Models
Hashemizadeh, Meraj, Ramirez, Juan, Sukumaran, Rohan, Farnadi, Golnoosh, Lacoste-Julien, Simon, Gallego-Posada, Jose
Model pruning is a popular approach to enable the deployment of large deep learning models on edge devices with restricted computational or storage capacities. Although sparse models achieve performance comparable to that of their dense counterparts at the level of the entire dataset, they exhibit high accuracy drops for some data sub-groups. Existing methods to mitigate this disparate impact induced by pruning (i) rely on surrogate metrics that address the problem indirectly and have limited interpretability; or (ii) scale poorly with the number of protected sub-groups in terms of computational cost. We propose a constrained optimization approach that directly addresses the disparate impact of pruning: our formulation bounds the accuracy change between the dense and sparse models, for each subgroup. This choice of constraints provides an interpretable success criterion to determine if a pruned model achieves acceptable disparity levels. Experimental results demonstrate that our technique scales reliably to problems involving large models and hundreds of protected sub-groups. Current deep learning practice displays a trend towards larger architectures (Bommasani et al., 2021), as exemplified by popular models such as GPT-4 (OpenAI, 2023), Llama 2 (Touvron et al., 2023) and DALL-E 2 (Ramesh et al., 2022). Model compression techniques such as pruning (Gale et al., 2019), knowledge distillation (Hinton et al., 2015), or quantization (Gholami et al., 2021) are crucial towards enabling the deployment of large models across a wide range of platforms, including resource-constrained edge devices like smartphones. Despite achieving comparable performance at an aggregate level over the entire dataset, pruned models often exhibit significant accuracy reduction for some data sub-groups (Hooker et al., 2019; 2020; Paganini, 2020). In particular, under-represented groups can suffer high performance degradation while the overall performance remains unaffected, thus exacerbating systemic biases in machine learning models. Tran et al. (2022) refer to this phenomenon as the disparate impact of pruning. Existing mitigation methods face challenges in terms of interpretability and scalability to a large number of sub-groups. Tran et al. (2022) introduce constraints aiming to equalize the loss of the sparse model across sub-groups. However, their approach does not account for the unequal grouplevel performance of the dense model. Moreover, while the loss can be a useful surrogate for training, this method addresses the disparate impact issue indirectly as it focuses on controlling the loss, rather than group-level changes in accuracy. Alternatively, Lin et al. (2022) compute per-group importance scores for every model parameter to determine the weights to be pruned. This approach becomes prohibitively expensive when the model or the number of sub-groups is large.
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Introducing Construct Theory as a Standard Methodology for Inclusive AI Models
Raj, Susanna, Jamthe, Sudha, Viswanath, Yashaswini, Lokiah, Suresh
Construct theory in social psychology, developed by George Kelly are mental constructs to predict and anticipate events. Constructs are how humans interpret, curate, predict and validate data; information. AI today is biased because it is trained with a narrow construct as defined by the training data labels. Machine Learning algorithms for facial recognition discriminate against darker skin colors and in the ground breaking research papers (Buolamwini, Joy and Timnit Gebru. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. FAT (2018), the inclusion of phenotypic labeling is proposed as a viable solution. In Construct theory, phenotype is just one of the many subelements that make up the construct of a face. In this paper, we present 15 main elements of the construct of face, with 50 subelements and tested Google Cloud Vision API and Microsoft Cognitive Services API using FairFace dataset that currently has data for 7 races, genders and ages, and we retested against FairFace Plus dataset curated by us. Our results show exactly where they have gaps for inclusivity. Based on our experiment results, we propose that validated, inclusive constructs become industry standards for AI ML models going forward.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia (0.04)
- Information Technology > Services (0.50)
- Law (0.47)
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
Berg, Hugo, Hall, Siobhan Mackenzie, Bhalgat, Yash, Yang, Wonsuk, Kirk, Hannah Rose, Shtedritski, Aleksandar, Bain, Max
Vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. To address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. We then investigate debiasing methods and show that prepending learned embeddings to text queries that are jointly trained with adversarial debiasing and a contrastive loss reduces various bias measures with minimal degradation to the image-text representation.
- Africa > Eswatini > Manzini > Manzini (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)