Goto

Collaborating Authors

 abroca


Toward Sufficient Statistical Power in Algorithmic Bias Assessment: A Test for ABROCA

Borchers, Conrad

arXiv.org Machine Learning

Algorithmic bias is a pressing concern in educational data mining (EDM), as it risks amplifying inequities in learning outcomes. The Area Between ROC Curves (ABROCA) metric is frequently used to measure discrepancies in model performance across demographic groups to quantify overall model fairness. However, its skewed distribution--especially when class or group imbalances exist--makes significance testing challenging. This study investigates ABROCA's distributional properties and contributes robust methods for its significance testing. Specifically, we address (1) whether ABROCA follows any known distribution, (2) how to reliably test for algorithmic bias using ABROCA, and (3) the statistical power achievable with ABROCA-based bias assessments under typical EDM sample specifications. Simulation results confirm that ABROCA does not match standard distributions, including those suited to accommodate skewness. We propose nonparametric randomization tests for ABROCA and demonstrate that reliably detecting bias with ABROCA requires large sample sizes or substantial effect sizes, particularly in imbalanced settings. Findings suggest that ABROCA-based bias evaluation based on sample sizes common in EDM tends to be underpowered, undermining the reliability of conclusions about model fairness. By offering open-source code to simulate power and statistically test ABROCA, this paper aims to foster more reliable statistical testing in EDM research. It supports broader efforts toward replicability and equity in educational modeling.


An experimental study on fairness-aware machine learning for credit scoring problem

Thu, Huyen Giang Thi, Doan, Thang Viet, Quy, Tai Le

arXiv.org Machine Learning

Digitalization of credit scoring is an essential requirement for financial organizations and commercial banks, especially in the context of digital transformation. Machine learning techniques are commonly used to evaluate customers' creditworthiness. However, the predicted outcomes of machine learning models can be biased toward protected attributes, such as race or gender. Numerous fairness-aware machine learning models and fairness measures have been proposed. Nevertheless, their performance in the context of credit scoring has not been thoroughly investigated. In this paper, we present a comprehensive experimental study of fairness-aware machine learning in credit scoring. The study explores key aspects of credit scoring, including financial datasets, predictive models, and fairness measures. We also provide a detailed evaluation of fairness-aware predictive models and fairness measures on widely used financial datasets.


Fair Knowledge Tracing in Second Language Acquisition

Tang, Weitao, Chen, Guanliang, Zu, Shuaishuai, Luo, Jiangyi

arXiv.org Artificial Intelligence

In the domain of second-language acquisition, predictive modeling serves as a pivotal tool for facilitating educators in implementing diversified teaching strategies, thereby garnering extensive research attention. Despite the prevalent focus on model accuracy in most existing studies, the exploration into model fairness remains substantially underexplored. Model fairness pertains to the equitable treatment of different groups by machine learning algorithms. It ensures that the model's predictions do not exhibit unintentional biases against certain groups based on attributes such as gender, ethnicity, age, or other potentially sensitive characteristics. In essence, a fair model should produce outcomes that are impartial and do not perpetuate existing prejudices, ensuring that no group is systematically disadvantaged. In this research, we evaluate the fairness of two predictive models based on second-language learning, utilizing three tracks from the Duolingo dataset: en_es (English learners who speak Spanish), es_en(Spanish learners who speak English), and fr_en(French learners who speak English). We measure (i) algorithmic fairness among different clients such as iOS, Android and Web and (ii) algorithmic fairness between developed countries and developing countries. Our findings indicate: 1) Deep learning exhibits a marked advantage over machine learning when applied to knowledge tracing based on second language acquisition, owing to its heightened accuracy and fairness.


ABROCA Distributions For Algorithmic Bias Assessment: Considerations Around Interpretation

Borchers, Conrad, Baker, Ryan S.

arXiv.org Machine Learning

Algorithmic bias is of critical concern within education as it could undermine the effectiveness of learning analytics. While different definitions and conceptualizations of algorithmic bias and fairness exist [2], their common denominator typically revolves around systematic unfairness or unequal treatment of groups caused by algorithms. This bias occurs when an algorithm produces results that disproportionately disadvantage or favor particular groups of people based on non-malleable characteristics like race, gender, or socioeconomic status [7]. Recent learning analytics research argued that although the vast majority of published papers investigating algorithmic bias in education find evidence of bias [2], some predictive models appear to achieve fairness, with minimal difference in model quality across demographic groups. For example, Zambrano et al. [18] evaluated careless detectors and Bayesian knowledge tracing models, finding near-equal performance across groups defined by race, gender, socioeconomic status, special needs, and English language learner status. Similarly, Jiang and Pardos [10] compared accuracies of grade prediction models across ethnic groups, concluding that an adversarial learning approach led to the fairest models but did not engage in the question of whether their fairest model was sufficiently fair.


Is Your Model "MADD"? A Novel Metric to Evaluate Algorithmic Fairness for Predictive Student Models

Verger, Mélina, Lallé, Sébastien, Bouchet, François, Luengo, Vanda

arXiv.org Artificial Intelligence

Predictive student models are increasingly used in learning environments due to their ability to enhance educational outcomes and support stakeholders in making informed decisions. However, predictive models can be biased and produce unfair outcomes, leading to potential discrimination against some students and possible harmful long-term implications. This has prompted research on fairness metrics meant to capture and quantify such biases. Nonetheless, so far, existing fairness metrics used in education are predictive performance-oriented, focusing on assessing biased outcomes across groups of students, without considering the behaviors of the models nor the severity of the biases in the outcomes. Therefore, we propose a novel metric, the Model Absolute Density Distance (MADD), to analyze models' discriminatory behaviors independently from their predictive performance. We also provide a complementary visualization-based analysis to enable fine-grained human assessment of how the models discriminate between groups of students. We evaluate our approach on the common task of predicting student success in online courses, using several common predictive classification models on an open educational dataset. We also compare our metric to the only predictive performance-oriented fairness metric developed in education, ABROCA. Results on this dataset show that: (1) fair predictive performance does not guarantee fair models' behaviors and thus fair outcomes, (2) there is no direct relationship between data bias and predictive performance bias nor discriminatory behaviors bias, and (3) trained on the same data, models exhibit different discriminatory behaviors, according to different sensitive features too. We thus recommend using the MADD on models that show satisfying predictive performance, to gain a finer-grained understanding on how they behave and to refine models selection and their usage.


Evaluation of group fairness measures in student performance prediction problems

Quy, Tai Le, Nguyen, Thi Huyen, Friege, Gunnar, Ntoutsi, Eirini

arXiv.org Artificial Intelligence

Predicting students' academic performance is one of the key tasks of educational data mining (EDM). Traditionally, the high forecasting quality of such models was deemed critical. More recently, the issues of fairness and discrimination w.r.t. protected attributes, such as gender or race, have gained attention. Although there are several fairness-aware learning approaches in EDM, a comparative evaluation of these measures is still missing. In this paper, we evaluate different group fairness measures for student performance prediction problems on various educational datasets and fairness-aware learning models. Our study shows that the choice of the fairness measure is important, likewise for the choice of the grade threshold.


Fair Classification via Transformer Neural Networks: Case Study of an Educational Domain

Sulaiman, Modar, Roy, Kallol

arXiv.org Artificial Intelligence

Educational technologies nowadays increasingly use data and Machine Learning (ML) models. This gives the students, instructors, and administrators support and insights for the optimum policy. However, it is well acknowledged that ML models are subject to bias, which raises concerns about the fairness, bias, and discrimination of using these automated ML algorithms in education and its unintended and unforeseen negative consequences. The contribution of bias during the decision-making comes from datasets used for training ML models and the model architecture. This paper presents a preliminary investigation of the fairness of transformer neural networks on the two tabular datasets: Law School and Student-Mathematics. In contrast to classical ML models, the transformer-based models transform these tabular datasets into a richer representation while solving the classification task. We use different fairness metrics for evaluation and check the trade-off between fairness and accuracy of the transformer-based models over the tabular datasets. Empirically, our approach shows impressive results regarding the trade-off between fairness and performance on the Law School dataset.