reference group
Do covariates explain why these groups differ? The choice of reference group can reverse conclusions in the Oaxaca-Blinder decomposition
Quintero, Manuel, Shreekumar, Advik, Stephenson, William T., Broderick, Tamara
Scientists often want to explain why an outcome is different in two groups. For instance, differences in patient mortality rates across two hospitals could be due to differences in the patients themselves (covariates) or differences in medical care (outcomes given covariates). The Oaxaca--Blinder decomposition (OBD) is a standard tool to tease apart these factors. It is well known that the OBD requires choosing one of the groups as a reference, and the numerical answer can vary with the reference. To the best of our knowledge, there has not been a systematic investigation into whether the choice of OBD reference can yield different substantive conclusions and how common this issue is. In the present paper, we give existence proofs in real and simulated data that the OBD references can yield substantively different conclusions and that these differences are not entirely driven by model misspecification or small data. We prove that substantively different conclusions occur in up to half of the parameter space, but find these discrepancies rare in the real-data analyses we study. We explain this empirical rarity by examining how realistic data-generating processes can be biased towards parameters that do not change conclusions under the OBD.
- North America > Mexico > Oaxaca (0.26)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Michigan (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence (0.94)
- Information Technology > Data Science (0.88)
Penalized Fair Regression for Multiple Groups in Chronic Kidney Disease
Nakamoto, Carter H., Chen, Lucia Lushi, Foryciarz, Agata, Rose, Sherri
Fair regression methods have the potential to mitigate societal bias concerns in health care, but there has been little work on penalized fair regression when multiple groups experience such bias. We propose a general regression framework that addresses this gap with unfairness penalties for multiple groups. Our approach is demonstrated for binary outcomes with true positive rate disparity penalties. It can be efficiently implemented through reduction to a cost-sensitive classification problem. We additionally introduce novel score functions for automatically selecting penalty weights. Our penalized fair regression methods are empirically studied in simulations, where they achieve a fairness-accuracy frontier beyond that of existing comparison methods. Finally, we apply these methods to a national multi-site primary care study of chronic kidney disease to develop a fair classifier for end-stage renal disease. There we find substantial improvements in fairness for multiple race and ethnicity groups who experience societal bias in the health care system without any appreciable loss in overall fit.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Alaska (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Nephrology (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.68)
Empowering Clinical Trial Design through AI: A Randomized Evaluation of PowerGPT
Lu, Yiwen, Li, Lu, Zhang, Dazheng, Jian, Xinyao, Wang, Tingyin, Chen, Siqi, Lei, Yuqing, Tong, Jiayi, Xi, Zhaohan, Chu, Haitao, Luo, Chongliang, Ogdie, Alexis, Athey, Brian, Turan, Alparslan, Abramoff, Michael, Cappelleri, Joseph C, Xu, Hua, Lu, Yun, Berlin, Jesse, Sessler, Daniel I., Asch, David A., Jiang, Xiaoqian, Chen, Yong
Sample size calculations for power analysis are critical for clinical research and trial design, yet their complexity and reliance on statistical expertise create barriers for many researchers. We introduce PowerGPT, an AI-powered system integrating large language models (LLMs) with statistical engines to automate test selection and sample size estimation in trial design. In a randomized trial to evaluate its effectiveness, PowerGPT significantly improved task completion rates (99.3% vs. 88.9% for test selection, 99.3% vs. 77.8% for sample size calculation) and accuracy (94.1% vs. 55.4% in sample size estimation, p < 0.001), while reducing average completion time (4.0 vs. 9.3 minutes, p < 0.001). These gains were consistent across various statistical tests and benefited both statisticians and non-statisticians as well as bridging expertise gaps. Already under deployment across multiple institutions, PowerGPT represents a scalable AI-driven approach that enhances accessibility, efficiency, and accuracy in statistical power analysis for clinical research.
- North America > United States > Pennsylvania (0.30)
- North America > United States > Texas (0.28)
- North America > United States > Iowa (0.28)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation
Huijzer, Willem, Chen, Jieying
The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study examines biases in LLMs related to background, gender, and age, with a focus on their impact on decision-making and summarization tasks. Additionally, the research examines the cross-lingual propagation of these biases and evaluates the effectiveness of prompt-instructed mitigation strategies. Using an adapted version of the dataset by Tamkin et al. (2023) translated into Dutch, we created 151,200 unique prompts for the decision task and 176,400 for the summarisation task. Various demographic variables, instructions, salience levels, and languages were tested on GPT-3.5 and GPT-4o. Our analysis revealed that both models were significantly biased during decision-making, favouring female gender, younger ages, and certain backgrounds such as the African-American background. In contrast, the summarisation task showed minimal evidence of bias, though significant age-related differences emerged for GPT-3.5 in English. Cross-lingual analysis showed that bias patterns were broadly similar between English and Dutch, though notable differences were observed across specific demographic categories. The newly proposed mitigation instructions, while unable to eliminate biases completely, demonstrated potential in reducing them. The most effective instruction achieved a 27\% mean reduction in the gap between the most and least favorable demographics. Notably, contrary to GPT-3.5, GPT-4o displayed reduced biases for all prompts in English, indicating the specific potential for prompt-based mitigation within newer models. This research underscores the importance of cautious adoption of LLMs and context-specific bias testing, highlighting the need for continued development of effective mitigation strategies to ensure responsible deployment of AI.
- Europe (1.00)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Government (1.00)
- Banking & Finance (0.93)
Trust in Disinformation Narratives: a Trust in the News Experiment
Song, Hanbyul, Silva, Miguel F. Santos, Suau, Jaume, Espinosa-Anke, Luis
Understanding why people trust or distrust one another, institutions, or information is a complex task that has led scholars from various fields of study to employ diverse epistemological and methodological approaches. Despite the challenges, it is generally agreed that the antecedents of trust (and distrust) encompass a multitude of emotional and cognitive factors, including a general disposition to trust and an assessment of trustworthiness factors. In an era marked by increasing political polarization, cultural backlash, widespread disinformation and fake news, and the use of AI software to produce news content, the need to study trust in the news has gained significant traction. This study presents the findings of a trust in the news experiment designed in collaboration with Spanish and UK journalists, fact-checkers, and the CardiffNLP Natural Language Processing research group. The purpose of this experiment, conducted in June 2023, was to examine the extent to which people trust a set of fake news articles based on previously identified disinformation narratives related to gender, climate change, and COVID-19. The online experiment participants (801 in Spain and 800 in the UK) were asked to read three fake news items and rate their level of trust on a scale from 1 (not true) to 8 (true). The pieces used a combination of factors, including stance (favourable, neutral, or against the narrative), presence of toxic expressions, clickbait titles, and sources of information to test which elements influenced people's responses the most. Half of the pieces were produced by humans and the other half by ChatGPT. The results show that the topic of news articles, stance, people's age, gender, and political ideologies significantly affected their levels of trust in the news, while the authorship (humans or ChatGPT) does not have a significant impact.
- Europe > Spain (0.25)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Media > News (1.00)
- Health & Medicine > Therapeutic Area > Immunology (0.52)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.35)
Finding Words Associated with DIF: Predicting Differential Item Functioning using LLMs and Explainable AI
We fine-tuned and compared several encoder-based Transformer large language models (LLM) to predict differential item functioning (DIF) from the item text. We then applied explainable artificial intelligence (XAI) methods to these models to identify specific words associated with DIF. The data included 42,180 items designed for English language arts and mathematics summative state assessments among students in grades 3 to 11. Prediction $R^2$ ranged from .04 to .32 among eight focal and reference group pairs. Our findings suggest that many words associated with DIF reflect minor sub-domains included in the test blueprint by design, rather than construct-irrelevant item content that should be removed from assessments. This may explain why qualitative reviews of DIF items often yield confusing or inconclusive results. Our approach can be used to screen words associated with DIF during the item-writing process for immediate revision, or help review traditional DIF analysis results by highlighting key words in the text. Extensions of this research can enhance the fairness of assessment programs, especially those that lack resources to build high-quality items, and among smaller subpopulations where we do not have sufficient sample sizes for traditional DIF analyses.
Navigating Fairness in Radiology AI: Concepts, Consequences,and Crucial Considerations
Venugopal, Vasantha Kumar, Gupta, Abhishek, Takhar, Rohit, Yee, Charlene Liew Jin, Jones, Catherine, Szarf, Gilberto
Artificial Intelligence (AI) has significantly revolutionized radiology, promising improved patient outcomes and streamlined processes. However, it's critical to ensure the fairness of AI models to prevent stealthy bias and disparities from leading to unequal outcomes. This review discusses the concept of fairness in AI, focusing on bias auditing using the Aequitas toolkit, and its real-world implications in radiology, particularly in disease screening scenarios. Aequitas, an open-source bias audit toolkit, scrutinizes AI models' decisions, identifying hidden biases that may result in disparities across different demographic groups and imaging equipment brands. This toolkit operates on statistical theories, analyzing a large dataset to reveal a model's fairness. It excels in its versatility to handle various variables simultaneously, especially in a field as diverse as radiology. The review explicates essential fairness metrics: Equal and Proportional Parity, False Positive Rate Parity, False Discovery Rate Parity, False Negative Rate Parity, and False Omission Rate Parity. Each metric serves unique purposes and offers different insights. We present hypothetical scenarios to demonstrate their relevance in disease screening settings, and how disparities can lead to significant real-world impacts.
- Oceania > Australia (0.14)
- Europe > United Kingdom (0.14)
- Asia > China (0.05)
- (7 more...)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Auditing ICU Readmission Rates in an Clinical Database: An Analysis of Risk Factors and Clinical Outcomes
This study presents a machine learning (ML) pipeline for clinical data classification in the context of a 30-day readmission problem, along with a fairness audit on subgroups based on sensitive attributes. A range of ML models are used for classification and the fairness audit is conducted on the model predictions. The fairness audit uncovers disparities in equal opportunity, predictive parity, false positive rate parity, and false negative rate parity criteria on the MIMIC III dataset based on attributes such as gender, ethnicity, language, and insurance group. The results identify disparities in the model's performance across different groups and highlights the need for better fairness and bias mitigation strategies. The study suggests the need for collaborative efforts among researchers, policymakers, and practitioners to address bias and fairness in artificial intelligence (AI) systems.
- North America > United States (0.29)
- North America > Canada > Ontario > Toronto (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Providers & Services (1.00)
- Government (1.00)
- (3 more...)
Fairness implications of encoding protected categorical attributes
Mougan, Carlos, Alvarez, Jose M., Patro, Gourab K, Ruggieri, Salvatore, Staab, Steffen
Protected attributes are often presented as categorical features that need to be encoded before feeding them into a machine learning algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we compare the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding. We distinguish between two types of induced bias that can arise while using these encodings and can lead to unfair models. The first type, irreducible bias, is due to direct group category discrimination and a second type, reducible bias, is due to large variance in less statistically represented groups. We take a deeper look into how regularization methods for target encoding can improve the induced bias while encoding categorical features. Furthermore, we tackle the problem of intersectional fairness that arises when mixing two protected categorical features leading to higher cardinality. This practice is a powerful feature engineering technique used for boosting model performance. We study its implications on fairness as it can increase both types of induced bias
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Hampshire > Southampton (0.04)
- (18 more...)
- Law (1.00)
- Government > Regional Government (0.67)
- Education > Curriculum > Subject-Specific Education (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Years Lived Alone And / Or Serial Break-UPS Strongly Linked to Inflammation in Men - Neuroscience News
Summary: Men who spend several years living alone or experience serial relationship breakups are at increased risk of inflammation, a new study reports. Living alone for several years and/or experiencing serial relationship break-ups are strongly linked to raised levels of inflammatory markers in the blood–but only in men–finds a large population study published online in the Journal of Epidemiology & Community Health. Although the inflammation was classified as low grade, it was persistent, and most likely indicates a heightened risk of age-related ill health and death, suggest the researchers. Divorce and committed relationship break-ups, which are often followed by a potentially lengthy period of living alone, have been associated with a heightened risk of poor physical and mental health, lowered immunity, and death. But most previously published studies have focused on the impact of one partnership dissolution, and then usually only on marital break-ups.