The Impact of Data Preparation on the Fairness of Software Systems
Valentim, Inês, Lourenço, Nuno, Antunes, Nuno
–arXiv.org Artificial Intelligence
--Machine learning models are widely adopted in scenarios that directly affect people. The development of software systems based on these models raises societal and legal concerns, as their decisions may lead to the unfair treatment of individuals based on attributes like race or gender . Data preparation is key in any machine learning pipeline, but its effect on fairness is yet to be studied in detail. In this paper, we evaluate how the fairness and effectiveness of the learned models are affected by the removal of the sensitive attribute, the encoding of the categorical attributes, and instance selection methods (including cross-validators and random undersampling). We used the Adult Income and the German Credit Data datasets, which are widely studied and known to have fairness concerns. We applied each data preparation technique individually to analyse the difference in predictive performance and fairness, using statistical parity difference, disparate impact, and the normalised prejudice index. The results show that fairness is affected by transformations made to the training data, particularly in imbalanced datasets. Removing the sensitive attribute is insufficient to eliminate all the unfairness in the predictions, as expected, but it is key to achieve fairer models. Additionally, the standard random undersampling with respect to the true labels is sometimes more prejudicial than performing no random undersampling. Software systems based on machine learning (ML) are being used at an increasingly higher rate and on a multitude of scenarios that have a significant impact on people's lives. Their ubiquity raises several legal and societal concerns, as decisions based on the output of ML models may introduce or perpetuate historical bias against some individuals, based on their intrinsic characteristics, such as race, gender or age. The use of automated decision-making systems is often appealing due to the gains associated with it, and might even be perceived as a step towards the eradication of personal bias from the process. Nevertheless, many are the risks associated with a careless adoption of decisions supported by these systems. In this context, fairness emerges as a key property in terms of the reliability and trustworthiness of software systems based on ML. These receive nowadays increased attention from regulatory institutions, with the recently approved European Union General Data Protection Regulation (GDPR) demanding organisations to handle personal data in a privacy-preserving, fair and transparent manner [1].
arXiv.org Artificial Intelligence
Oct-5-2019
- Country:
- Oceania > Australia
- Western Australia > Perth (0.04)
- New South Wales > Sydney (0.04)
- North America > United States
- District of Columbia > Washington (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Portugal > Coimbra
- Coimbra (0.04)
- United Kingdom > England
- Oceania > Australia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government
- Europe Government (0.34)
- Technology: