Jesus, Sérgio
Fair-OBNC: Correcting Label Noise for Fairer Datasets
Silva, Inês Oliveira e, Jesus, Sérgio, Ferreira, Hugo, Saleiro, Pedro, Sousa, Inês, Bizarro, Pedro, Soares, Carlos
Data used by automated decision-making systems, such as Machine Learning models, often reflects discriminatory behavior that occurred in the past. These biases in the training data are sometimes related to label noise, such as in COMPAS, where more African-American offenders are wrongly labeled as having a higher risk of recidivism when compared to their White counterparts. Models trained on such biased data may perpetuate or even aggravate the biases with respect to sensitive information, such as gender, race, or age. However, while multiple label noise correction approaches are available in the literature, these focus on model performance exclusively. In this work, we propose Fair-OBNC, a label noise correction method with fairness considerations, to produce training datasets with measurable demographic parity. The presented method adapts Ordering-Based Noise Correction, with an adjusted criterion of ordering, based both on the margin of error of an ensemble, and the potential increase in the observed demographic parity of the dataset. We evaluate Fair-OBNC against other different pre-processing techniques, under different scenarios of controlled label noise. Our results show that the proposed method is the overall better alternative within the pool of label correction methods, being capable of attaining better reconstructions of the original labels. Models trained in the corrected data have an increase, on average, of 150% in demographic parity, when compared to models trained in data with noisy labels, across the considered levels of label noise.
Aequitas Flow: Streamlining Fair ML Experimentation
Jesus, Sérgio, Saleiro, Pedro, Silva, Inês Oliveira e, Jorge, Beatriz M., Ribeiro, Rita P., Gama, João, Bizarro, Pedro, Ghani, Rayid
Aequitas Flow is an open-source framework for end-to-end Fair Machine Learning (ML) experimentation in Python. This package fills the existing integration gaps in other Fair ML packages of complete and accessible experimentation. It provides a pipeline for fairness-aware model training, hyperparameter optimization, and evaluation, enabling rapid and simple experiments and result analysis. Aimed at ML practitioners and researchers, the framework offers implementations of methods, datasets, metrics, and standard interfaces for these components to improve extensibility. By facilitating the development of fair ML practices, Aequitas Flow seeks to enhance the adoption of these concepts in AI technologies.
Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints
Alves, Jean V., Leitão, Diogo, Jesus, Sérgio, Sampaio, Marco O. P., Liébana, Javier, Saleiro, Pedro, Figueiredo, Mário A. T., Bizarro, Pedro
Learning to defer (L2D) aims to improve human-AI collaboration systems by learning how to defer decisions to humans when they are more likely to be correct than an ML classifier. Existing research in L2D overlooks key aspects of real-world systems that impede its practical adoption, namely: i) neglecting cost-sensitive scenarios, where type 1 and type 2 errors have different costs; ii) requiring concurrent human predictions for every instance of the training dataset and iii) not dealing with human work capacity constraints. To address these issues, we propose the deferral under cost and capacity constraints framework (DeCCaF). DeCCaF is a novel L2D approach, employing supervised learning to model the probability of human error under less restrictive data requirements (only one expert prediction per instance) and using constraint programming to globally minimize the error cost subject to workload limitations. We test DeCCaF in a series of cost-sensitive fraud detection scenarios with different teams of 9 synthetic fraud analysts, with individual work capacity constraints. The results demonstrate that our approach performs significantly better than the baselines in a wide array of scenarios, achieving an average 8.4% reduction in the misclassification cost.
FiFAR: A Fraud Detection Dataset for Learning to Defer
Alves, Jean V., Leitão, Diogo, Jesus, Sérgio, Sampaio, Marco O. P., Saleiro, Pedro, Figueiredo, Mário A. T., Bizarro, Pedro
Public dataset limitations have significantly hindered the development and benchmarking of learning to defer (L2D) algorithms, which aim to optimally combine human and AI capabilities in hybrid decision-making systems. In such systems, human availability and domain-specific concerns introduce difficulties, while obtaining human predictions for training and evaluation is costly. Financial fraud detection is a high-stakes setting where algorithms and human experts often work in tandem; however, there are no publicly available datasets for L2D concerning this important application of human-AI teaming. To fill this gap in L2D research, we introduce the Financial Fraud Alert Review Dataset (FiFAR), a synthetic bank account fraud detection dataset, containing the predictions of a team of 50 highly complex and varied synthetic fraud analysts, with varied bias and feature dependence. We also provide a realistic definition of human work capacity constraints, an aspect of L2D systems that is often overlooked, allowing for extensive testing of assignment systems under real-world conditions. We use our dataset to develop a capacity-aware L2D method and rejection learning approach under realistic data availability conditions, and benchmark these baselines under an array of 300 distinct testing scenarios. We believe that this dataset will serve as a pivotal instrument in facilitating a systematic, rigorous, reproducible, and transparent evaluation and comparison of L2D methods, thereby fostering the development of more synergistic human-AI collaboration in decision-making systems. The public dataset and detailed synthetic expert information are available at: https://github.com/feedzai/fifar-dataset
A Case Study on Designing Evaluations of ML Explanations with Simulated User Studies
Martin, Ada, Chen, Valerie, Jesus, Sérgio, Saleiro, Pedro
When conducting user studies to ascertain the usefulness of model explanations in aiding human decision-making, it is important to use real-world use cases, data, and users. However, this process can be resource-intensive, allowing only a limited number of explanation methods to be evaluated. Simulated user evaluations (SimEvals), which use machine learning models as a proxy for human users, have been proposed as an intermediate step to select promising explanation methods. In this work, we conduct the first SimEvals on a real-world use case to evaluate whether explanations can better support ML-assisted decision-making in e-commerce fraud detection. We study whether SimEvals can corroborate findings from a user study conducted in this fraud detection context. In particular, we find that SimEvals suggest that all considered explainers are equally performant, and none beat a baseline without explanations -- this matches the conclusions of the original user study. Such correspondences between our results and the original user study provide initial evidence in favor of using SimEvals before running user studies. We also explore the use of SimEvals as a cheap proxy to explore an alternative user study set-up. We hope that this work motivates further study of when and how SimEvals should be used to aid in the design of real-world evaluations.
FairGBM: Gradient Boosting with Fairness Constraints
Cruz, André F, Belém, Catarina, Jesus, Sérgio, Bravo, João, Saleiro, Pedro, Bizarro, Pedro
Tabular data is prevalent in many high-stakes domains, such as financial services or public policy. Gradient Boosted Decision Trees (GBDT) are popular in these settings due to their scalability, performance, and low training cost. While fairness in these domains is a foremost concern, existing in-processing Fair ML methods are either incompatible with GBDT, or incur in significant performance losses while taking considerably longer to train. We present FairGBM, a dual ascent learning framework for training GBDT under fairness constraints, with little to no impact on predictive performance when compared to unconstrained GBDT. Since observational fairness metrics are non-differentiable, we propose smooth convex error rate proxies for common fairness criteria, enabling gradient-based optimization using a ``proxy-Lagrangian'' formulation. Our implementation shows an order of magnitude speedup in training time relative to related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.
On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods
Amarasinghe, Kasun, Rodolfa, Kit T., Jesus, Sérgio, Chen, Valerie, Balayan, Vladimir, Saleiro, Pedro, Bizarro, Pedro, Talwalkar, Ameet, Ghani, Rayid
Evaluation studies frequently rely on simplified experimental settings with non-expert users (e.g., workers on Amazon Mechanical Turk), use proxy tasks (e.g., forward simulation), or use subjective, user-reported measures as metrics of explanation quality [9, 16, 18, 19, 25, 26, 31]. Such settings are not equipped to evaluate the real-world utility of explainable ML methods since proxy task performance does not reflect real-task performance [3], users' perception of explanation usefulness is not reflective of utility in a task [3, 17], and proxy users do not reflect how expert users would use explanations [1]. A few studies evaluate explainable ML methods on their intended deployment settings where domain expert users perform the intended task [10, 20] (dubbed application-grounded evaluation studies in [6]). However, even in those, we argue that experimental design flaws (e.g., not isolating the incremental impact of explanations in [20]) and seemingly trivial design choices that deviate experimental settings from the deployment context (e.g., using metrics that do not reflect the task objectives in [10]), limit the applicability of drawn conclusions. We elaborate on these limitations in Section 2. In this work, we seek to bridge this critical gap by conducting a study that evaluates explainable ML methods in a setting consistent with the intended deployment context. Our study builds on the e-commerce fraud detection setting used in a previous evaluation study [10] consisting of professional fraud analysts tasked with reviewing e-commerce transactions to detect fraud when the ML model is uncertain about the outcome. We identify several simplifying assumptions made by the previous study that deviated from the deployment context and modify the setup to relax those assumptions (summarized in Table 1 and discussed in detail in Section 3.2). These modifications make the experimental setup faithful to the deployment setting and equipped to evaluate the utility of the explainable ML methods considered. Our setup results in dramatically different conclusions of the relative utility of ML model scores and explanations compared to the earlier work [10].
Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation
Jesus, Sérgio, Pombal, José, Alves, Duarte, Cruz, André, Saleiro, Pedro, Ribeiro, Rita P., Gama, João, Bizarro, Pedro
Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations
Jesus, Sérgio, Belém, Catarina, Balayan, Vladimir, Bento, João, Saleiro, Pedro, Bizarro, Pedro, Gama, João
There have been several research works proposing new Explainable AI (XAI) methods designed to generate model explanations having specific properties, or desiderata, such as fidelity, robustness, or human-interpretability. However, explanations are seldom evaluated based on their true practical impact on decision-making tasks. Without that assessment, explanations might be chosen that, in fact, hurt the overall performance of the combined system of ML model + end-users. This study aims to bridge this gap by proposing XAI Test, an application-grounded evaluation methodology tailored to isolate the impact of providing the end-user with different levels of information. We conducted an experiment following XAI Test to evaluate three popular post-hoc explanation methods -- LIME, SHAP, and TreeInterpreter -- on a real-world fraud detection task, with real data, a deployed ML model, and fraud analysts. During the experiment, we gradually increased the information provided to the fraud analysts in three stages: Data Only, i.e., just transaction data without access to model score nor explanations, Data + ML Model Score, and Data + ML Model Score + Explanations. Using strong statistical analysis, we show that, in general, these popular explainers have a worse impact than desired. Some of the conclusion highlights include: i) showing Data Only results in the highest decision accuracy and the slowest decision time among all variants tested, ii) all the explainers improve accuracy over the Data + ML Model Score variant but still result in lower accuracy when compared with Data Only; iii) LIME was the least preferred by users, probably due to its substantially lower variability of explanations from case to case.