Goto

Collaborating Authors

 Rudin, Cynthia


This Looks Better than That: Better Interpretable Models with ProtoPNeXt

arXiv.org Artificial Intelligence

Prototypical-part models are a popular interpretable alternative to black-box deep learning models for computer vision. However, they are difficult to train, with high sensitivity to hyperparameter tuning, inhibiting their application to new datasets and our understanding of which methods truly improve their performance. To facilitate the careful study of prototypical-part networks (ProtoPNets), we create a new framework for integrating components of prototypical-part models -- ProtoPNeXt. Using ProtoPNeXt, we show that applying Bayesian hyperparameter tuning and an angular prototype similarity metric to the original ProtoPNet is sufficient to produce new state-of-the-art accuracy for prototypical-part models on CUB-200 across multiple backbones. We further deploy this framework to jointly optimize for accuracy and prototype interpretability as measured by metrics included in ProtoPNeXt. Using the same resources, this produces models with substantially superior semantics and changes in accuracy between +1.3% and -1.5%. The code and trained models will be made publicly available upon publication.


SiamQuality: A ConvNet-Based Foundation Model for Imperfect Physiological Signals

arXiv.org Artificial Intelligence

Foundation models, especially those using transformers as backbones, have gained significant popularity, particularly in language and language-vision tasks. However, large foundation models are typically trained on high-quality data, which poses a significant challenge, given the prevalence of poor-quality real-world data. This challenge is more pronounced for developing foundation models for physiological data; such data are often noisy, incomplete, or inconsistent. The present work aims to provide a toolset for developing foundation models on physiological data. We leverage a large dataset of photoplethysmography (PPG) signals from hospitalized intensive care patients. For this data, we propose SimQuality, a novel self-supervised learning task based on convolutional neural networks (CNNs) as the backbone to enforce representations to be similar for good and poor quality signals that are from similar physiological states. We pre-trained the SimQuality on over 36 million 30-second PPG pairs and then fine-tuned and tested on six downstream tasks using external datasets. The results demonstrate the superiority of the proposed approach on all the downstream tasks, which are extremely important for heart monitoring on wearable devices. Our method indicates that CNNs can be an effective backbone for foundation models that are robust to training data quality.


Data Poisoning Attacks on Off-Policy Policy Evaluation Methods

arXiv.org Artificial Intelligence

Off-policy Evaluation (OPE) methods are a crucial tool for evaluating policies in high-stakes domains such as healthcare, where exploration is often infeasible, unethical, or expensive. However, the extent to which such methods can be trusted under adversarial threats to data quality is largely unexplored. In this work, we make the first attempt at investigating the sensitivity of OPE methods to marginal adversarial perturbations to the data. We design a generic data poisoning attack framework leveraging influence functions from robust statistics to carefully construct perturbations that maximize error in the policy value estimates. We carry out extensive experimentation with multiple healthcare and control datasets. Our results demonstrate that many existing OPE methods are highly prone to generating value estimates with large errors when subject to data poisoning attacks, even for small adversarial perturbations. These findings question the reliability of policy values derived using OPE methods and motivate the need for developing OPE methods that are statistically robust to train-time data poisoning attacks.


Sparse and Faithful Explanations Without Sparse Models

arXiv.org Machine Learning

Even if a model is not globally sparse, it is possible for decisions made from that model to be accurately and faithfully described by a small number of features. For instance, an application for a large loan might be denied to someone because they have no credit history, which overwhelms any evidence towards their creditworthiness. In this work, we introduce the Sparse Explanation Value (SEV), a new way of measuring sparsity in machine learning models. In the loan denial example above, the SEV is 1 because only one factor is needed to explain why the loan was denied. SEV is a measure of decision sparsity rather than overall model sparsity, and we are able to show that many machine learning models -- even if they are not sparse -- actually have low decision sparsity, as measured by SEV. SEV is defined using movements over a hypercube, allowing SEV to be defined consistently over various model classes, with movement restrictions reflecting real-world constraints. We proposed the algorithms that reduce SEV without sacrificing accuracy, providing sparse and completely faithful explanations, even without globally sparse models.


What is different between these datasets?

arXiv.org Artificial Intelligence

The performance of machine learning models heavily depends on the quality of input data, yet real-world applications often encounter various data-related challenges. One such challenge could arise when curating training data or deploying the model in the real world - two comparable datasets in the same domain may have different distributions. While numerous techniques exist for detecting distribution shifts, the literature lacks comprehensive approaches for explaining dataset differences in a human-understandable manner. To address this gap, we propose a suite of interpretable methods (toolbox) for comparing two datasets. We demonstrate the versatility of our approach across diverse data modalities, including tabular data, language, images, and signals in both low and high-dimensional settings. Our methods not only outperform comparable and related approaches in terms of explanation quality and correctness, but also provide actionable, complementary insights to understand and mitigate dataset differences effectively.


Optimal Sparse Survival Trees

arXiv.org Artificial Intelligence

Interpretability is crucial for doctors, hospitals, pharmaceutical companies and biotechnology corporations to analyze and make decisions for high stakes problems that involve human health. Tree-based methods have been widely adopted for \textit{survival analysis} due to their appealing interpretablility and their ability to capture complex relationships. However, most existing methods to produce survival trees rely on heuristic (or greedy) algorithms, which risk producing sub-optimal models. We present a dynamic-programming-with-bounds approach that finds provably-optimal sparse survival tree models, frequently in only a few seconds.


Learned Kernels for Interpretable and Efficient Medical Time Series Processing

arXiv.org Artificial Intelligence

Background: Signal processing methods are the foundation for clinical interpretation across a wide variety of medical applications. The advent of deep learning allowed for an explosion of new models that offered unprecedented performance but at a cost: deep learning models are often compute-intensive and lack interpretability. Methods: We propose a sparse, interpretable architecture for medical time series processing. The method learns a set of lightweight flexible kernels to construct a single-layer neural network, providing a new efficient, robust, and interpretable approach. We introduce novel parameter reduction techniques to further reduce the size of our network. We demonstrate the power of our architecture on the important task of photoplethysmography artifact detection, where our approach has performance similar to the state-of-the-art deep neural networks with several orders of magnitude fewer parameters, allowing for the integration of deep neural network level performance into extremely low-power wearable devices. Results: Our interpretable method achieves greater than 99\% of the performance of the state-of-the-art methods on the artifact detection task, and even outperforms the state-of-the-art on a challenging out-of-distribution test set, while using dramatically fewer parameters (2\% of the parameters of Segade, and about half of the parameters of Tiny-PPG). Conclusions: Learned kernels are competitive with deep neural networks for medical time series processing with dramatically fewer parameters. Our method is particularly suited for real-time applications and low-power devices, and it maintains interpretability.


Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data

arXiv.org Artificial Intelligence

Many modern causal questions ask how treatments affect complex outcomes that are measured using wearable devices and sensors. Current analysis approaches require summarizing these data into scalar statistics (e.g., the mean), but these summaries can be misleading. For example, disparate distributions can have the same means, variances, and other statistics. Researchers can overcome the loss of information by instead representing the data as distributions. We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision-making: Analyzing Distributional Data via Matching After Learning to Stretch (ADD MALTS). We (i) provide analytical guarantees of the correctness of our estimation strategy, (ii) demonstrate via simulation that ADD MALTS outperforms other distributional data analysis methods at estimating treatment effects, and (iii) illustrate ADD MALTS' ability to verify whether there is enough cohesion between treatment and control units within subpopulations to trustworthily estimate treatment effects. We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks.


The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance

arXiv.org Machine Learning

Quantifying variable importance is essential for answering high-stakes questions in fields like genetics, public policy, and medicine. Current methods generally calculate variable importance for a given model trained on a given dataset. However, for a given dataset, there may be many models that explain the target outcome equally well; without accounting for all possible explanations, different researchers may arrive at many conflicting yet equally valid conclusions given the same data. Additionally, even when accounting for all possible explanations for a given dataset, these insights may not generalize because not all good explanations are stable across reasonable data perturbations. We propose a new variable importance framework that quantifies the importance of a variable across the set of all good models and is stable across the data distribution. Our framework is extremely flexible and can be integrated with most existing model classes and global variable importance metrics. We demonstrate through experiments that our framework recovers variable importance rankings for complex simulation setups where other methods fail. Further, we show that our framework accurately estimates the true importance of a variable for the underlying data distribution. We provide theoretical guarantees on the consistency and finite sample error rates for our estimator. Finally, we demonstrate its utility with a real-world case study exploring which genes are important for predicting HIV load in persons with HIV, highlighting an important gene that has not previously been studied in connection with HIV. Code is available at https://github.com/jdonnelly36/Rashomon_Importance_Distribution.


Reconsideration on evaluation of machine learning models in continuous monitoring using wearables

arXiv.org Artificial Intelligence

Especially with the utilization of photoplethysmography (PPG) signal, these devices have demonstrated significant potential in providing real-time insights into an individual's health status. PPG, due to its non-invasive nature and ease of integration into wearable technology, has become a cornerstone in modern health monitoring systems [5]. Analyzing wearable device signals often involves ML models of different complexities [6, 7]. In the model development phase, typically, continuous signals are cut into discrete segments, and the model's performance is evaluated at the segment level using conventional metrics such as accuracy, sensitivity, specificity, and F1 score [8]. However, relying solely on these conventional metrics at the segment level does not provide a holistic assessment and hurts both consumers by making it impossible to select optimal solution for their needs and innovators by failing to guide their effort towards true progresses. The complex nature of continuous health monitoring using wearable devices introduces unique challenges beyond conventional evaluation approaches' capabilities, as illustrated in Figure 1. Recognizing these challenges is imperative for imbuing continuous health monitoring applications with accurate and reliable ML models to ensure a successful translation of these models into everyday use by millions of people and fulfill the potential of this technology at scale. In the subsequent sections, we outline the challenges in evaluating ML models for continuous health monitoring using wearables, thoroughly review existing evaluation methods and metrics, and propose a standardized evaluation guideline.