Goto

Collaborating Authors

 unfair treatment


Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

Neural Information Processing Systems

We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them.Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations.Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3.We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization: none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage.


Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

Neural Information Processing Systems

We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them.Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations.Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3.We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization: none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage.


Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

Neural Information Processing Systems

We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them.Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations.Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3.We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization: none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage.


A Distributionally Robust Optimisation Approach to Fair Credit Scoring

Casas, Pablo, Mues, Christophe, Yu, Huan

arXiv.org Artificial Intelligence

Credit scoring has been catalogued by the European Commission and the Executive Office of the US President as a high-risk classification task, a key concern being the potential harms of making loan approval decisions based on models that would be biased against certain groups. To address this concern, recent credit scoring research has considered a range of fairness-enhancing techniques put forward by the machine learning community to reduce bias and unfair treatment in classification systems. While the definition of fairness or the approach they follow to impose it may vary, most of these techniques, however, disregard the robustness of the results. This can create situations where unfair treatment is effectively corrected in the training set, but when producing out-of-sample classifications, unfair treatment is incurred again. Instead, in this paper, we will investigate how to apply Distributionally Robust Optimisation (DRO) methods to credit scoring, thereby empirically evaluating how they perform in terms of fairness, ability to classify correctly, and the robustness of the solution against changes in the marginal proportions. In so doing, we find DRO methods to provide a substantial improvement in terms of fairness, with almost no loss in performance. These results thus indicate that DRO can improve fairness in credit scoring, provided that further advances are made in efficiently implementing these systems. In addition, our analysis suggests that many of the commonly used fairness metrics are unsuitable for a credit scoring setting, as they depend on the choice of classification threshold.


Study finds AI systems exhibit human-like prejudices

#artificialintelligence

Whether we like to believe it or not, scientific research has clearly shown that we all have deeply ingrained biases, which create stereotypes in our mind that can often lead to unfair treatment of others. As artificial intelligence (AI) plays an increasingly important role in our lives as decision makers in self-driving cars, doctor offices, and surveillance, it becomes critical to ask whether AI exhibits the same inbuilt biases as humans. According to a new study conducted by a team of researchers at Princeton, many AI systems do in fact exhibit racial and gender biases that could prove problematic in some cases. One well established way for psychologists to detect biases is the Implicit Association Test. Introduced into the scientific literature in 1998 and widely used today in clinical, cognitive, and developmental research, the test is designed to measure the strength of a person's automatic association between concepts or objects in memory.