Shanmugam, Divya
Evaluating multiple models using labeled and unlabeled data
Shanmugam, Divya, Sadhuka, Shuvom, Raghavan, Manish, Guttag, John, Berger, Bonnie, Pierson, Emma
It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1 relative to using labeled data alone and 2.4 relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models. Rigorous evaluation is essential to the safe deployment of machine learning classifiers. The standard approach is to measure classifier performance using a large labeled dataset. In practice, however, labeled data is often scarce (Culotta & McCallum, 2005; Dutta & Das, 2023). Exacerbating the challenge of evaluation, the number of off-the-shelf classifiers has increased dramatically through the widespread usage of model hubs. The modern machine learning practitioner thus has a myriad of trained models, but little labeled data with which to evaluate them. In many domains, unlabeled data is much more abundant than labeled data (Bepler et al., 2019; Sagawa et al., 2021; Movva et al., 2024).
Learning Disease Progression Models That Capture Health Disparities
Chiang, Erica, Shanmugam, Divya, Beecy, Ashley N., Sayer, Gabriel, Uriel, Nir, Estrin, Deborah, Garg, Nikhil, Pierson, Emma
Disease progression models are widely used to inform the diagnosis and treatment of many progressive diseases. However, a significant limitation of existing models is that they do not account for health disparities that can bias the observed data. To address this, we develop an interpretable Bayesian disease progression model that captures three key health disparities: certain patient populations may (1) start receiving care only when their disease is more severe, (2) experience faster disease progression even while receiving care, or (3) receive follow-up care less frequently conditional on disease severity. We show theoretically and empirically that failing to account for disparities produces biased estimates of severity (underestimating severity for disadvantaged groups, for example). On a dataset of heart failure patients, we show that our model can identify groups that face each type of health disparity, and that accounting for these disparities meaningfully shifts which patients are considered high-risk.
Generative AI in Medicine
Shanmugam, Divya, Agrawal, Monica, Movva, Rajiv, Chen, Irene Y., Ghassemi, Marzyeh, Jacobs, Maia, Pierson, Emma
Excitement about the promise of generative AI in medicine has inspired an explosion of new applications. Generative models have the potential to change how care is delivered (1-5), the roles and responsibilities of care providers (6, 7), and the communication pathways between patients and providers (8, 9). Further upstream, generative models have shown promise in improving scientific discovery in medicine (through both clinical trials (10, 11) and observational research (12, 13)) and facilitating medical education (8, 14). These developments are a direct result of technical advances in generative AI, which have drastically increased the ability to generate realistic language and images, and raise important questions about how to integrate generative models into medicine. Generative AI is the latest in a series of technical advances that have driven major shifts in medicine. Past significant advances include the adoption of electronic health records (EHRs); the integration of robotics into telesurgeries (15); and the incorporation of predictive models and continuous monitoring as foundational infrastructure for new diagnostic tools (16, 17).
Machine Learning for Health symposium 2023 -- Findings track
Hegselmann, Stefan, Parziale, Antonio, Shanmugam, Divya, Tang, Shengpu, Asiedu, Mercy Nyamewaa, Chang, Serina, Hartvigsen, Thomas, Singh, Harvineet
A collection of the accepted Findings papers that were presented at the 3rd Machine Learning for Health symposium (ML4H 2023), which was held on December 10, 2023, in New Orleans, Louisiana, USA. ML4H 2023 invited high-quality submissions on relevant problems in a variety of health-related disciplines including healthcare, biomedicine, and public health. Two submission tracks were offered: the archival Proceedings track, and the non-archival Findings track. Proceedings were targeted at mature work with strong technical sophistication and a high impact to health. The Findings track looked for new ideas that could spark insightful discussion, serve as valuable resources for the community, or could enable new collaborations. Submissions to the Proceedings track, if not accepted, were automatically considered for the Findings track. All the manuscripts submitted to ML4H Symposium underwent a double-blind peer-review process.
Quantifying disparities in intimate partner violence: a machine learning method to correct for underreporting
Shanmugam, Divya, Hou, Kaihua, Pierson, Emma
Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We conduct experiments on synthetic and real health data which demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and demonstrate the method's robustness to plausible violations of the covariate shift assumption. We conclude by illustrating the applicability of our method to case studies of intimate partner violence and hate speech.
Coarse race data conceals disparities in clinical risk score performance
Movva, Rajiv, Shanmugam, Divya, Hou, Kaihua, Pathak, Priya, Guttag, John, Garg, Nikhil, Pierson, Emma
Healthcare data in the United States often records only a patient's coarse race group: for example, both Indian and Chinese patients are typically coded as "Asian." It is unknown, however, whether this coarse coding conceals meaningful disparities in the performance of clinical risk scores across granular race groups. Here we show that it does. Using data from 418K emergency department visits, we assess clinical risk score performance disparities across 26 granular groups for three outcomes, five risk scores, and four performance metrics. Across outcomes and metrics, we show that the risk scores exhibit significant granular performance disparities within coarse race groups. In fact, variation in performance within coarse groups often *exceeds* the variation between coarse groups. We explore why these disparities arise, finding that outcome rates, feature distributions, and the relationships between features and outcomes all vary significantly across granular groups. Our results suggest that healthcare providers, hospital systems, and machine learning researchers should strive to collect, release, and use granular race data in place of coarse race data, and that existing analyses may significantly underestimate racial disparities in performance.
Multiple Instance Learning for ECG Risk Stratification
Shanmugam, Divya, Blalock, Davis, Gong, Jen G., Guttag, John
In this paper, we apply a multiple instance learning paradigm to signal-based risk stratification for cardiovascular outcomes. In contrast to methods that require handcrafted features or domain knowledge, our method learns a representation with state-of-the-art predictive power from the raw ECG signal. We accomplish this by leveraging the multiple instance learning framework. This framework is particularly valuable to learning from biometric signals, where patient-level labels are available but signal segments are rarely annotated. We make two contributions in this paper: 1) reframing risk stratification for cardiovascular death (CVD) as a multiple instance learning problem, and 2) using this framework to design a new risk score, for which patients in the highest quartile are 15.9 times more likely to die of CVD within 90 days of hospital admission for an acute coronary syndrome.