Goto

Collaborating Authors

 Lu, Charles


Data Measurements for Decentralized Data Markets

arXiv.org Artificial Intelligence

Decentralized data markets can provide more equitable forms of data acquisition for machine learning. However, to realize practical marketplaces, efficient techniques for seller selection need to be developed. We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets. Diversity and relevance measures enable a buyer to make relative comparisons between sellers without requiring intermediate brokers and training task-dependent models.


Data Acquisition via Experimental Design for Decentralized Data Markets

arXiv.org Artificial Intelligence

Acquiring high-quality training data is essential for current machine learning models. Data markets provide a way to increase the supply of data, particularly in data-scarce domains such as healthcare, by incentivizing potential data sellers to join the market. A major challenge for a data buyer in such a market is selecting the most valuable data points from a data seller. Unlike prior work in data valuation, which assumes centralized data access, we propose a federated approach to the data selection problem that is inspired by linear experimental design. Our proposed data selection method achieves lower prediction error without requiring labeled validation data and can be optimized in a fast and federated procedure. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.


Deploying clinical machine learning? Consider the following...

arXiv.org Artificial Intelligence

Despite the intense attention and considerable investment into clinical machine learning research, relatively few applications have been deployed at a large-scale in a real-world clinical environment. While research is important in advancing the state-of-the-art, translation is equally important in bringing these techniques and technologies into a position to ultimately impact healthcare. We believe a lack of appreciation for several considerations are a major cause for this discrepancy between expectation and reality. To better characterize a holistic perspective among researchers and practitioners, we survey several practitioners with commercial experience in developing CML for clinical deployment. Using these insights, we identify several main categories of challenges in order to better design and develop clinical machine learning applications.


Federated Conformal Predictors for Distributed Uncertainty Quantification

arXiv.org Artificial Intelligence

Conformal prediction is emerging as a popular paradigm for providing rigorous uncertainty quantification in machine learning since it can be easily applied as a post-processing step to already trained models. In this paper, we extend conformal prediction to the federated learning setting. The main challenge we face is data heterogeneity across the clients - this violates the fundamental tenet of exchangeability required for conformal prediction. We propose a weaker notion of partial exchangeability, better suited to the FL setting, and use it to develop the Federated Conformal Prediction (FCP) framework. We show FCP enjoys rigorous theoretical guarantees and excellent empirical performance on several computer vision and medical imaging datasets. Our results demonstrate a practical approach to incorporating meaningful uncertainty quantification in distributed and heterogeneous environments. We provide code used in our experiments https://github.com/clu5/federated-conformal.


Addressing catastrophic forgetting for medical domain expansion

arXiv.org Artificial Intelligence

Model brittleness is a key concern when deploying deep learning models in real-world medical settings. A model that has high performance at one institution may suffer a significant decline in performance when tested at other institutions. While pooling datasets from multiple institutions and re-training may provide a straightforward solution, it is often infeasible and may compromise patient privacy. An alternative approach is to fine-tune the model on subsequent institutions after training on the original institution. Notably, this approach degrades model performance at the original institution, a phenomenon known as catastrophic forgetting. In this paper, we develop an approach to address catastrophic forgetting based on elastic weight consolidation combined with modulation of batch normalization statistics under two scenarios: first, for expanding the domain from one imaging system's data to another imaging system's, and second, for expanding the domain from a large multi-institutional dataset to another single institution dataset. We show that our approach outperforms several other state-of-the-art approaches and provide theoretical justification for the efficacy of batch normalization modulation. The results of this study are generally applicable to the deployment of any clinical deep learning model which requires domain expansion.


Stacked Neural Networks for end-to-end ciliary motion analysis

arXiv.org Machine Learning

Cilia are hairlike structures protruding from nearly every cell in the body. Diseases known as ciliopathies, where cilia function is disrupted, can result in a wide spectrum of disorders. However, most techniques for assessing ciliary motion rely on manual identification and tracking of cilia; this process is laborious and error-prone, and does not scale well. Even where automated ciliary motion analysis tools exist, their applicability is limited. Here, we propose an end-to-end computational machine learning pipeline that automatically identifies regions of cilia from videos, extracts patches of cilia, and classifies patients as exhibiting normal or abnormal ciliary motion. In particular, we demonstrate how convolutional LSTM are able to encode complex features while remaining sensitive enough to differentiate between a variety of motion patterns. Our framework achieves 90% with only a few hundred training epochs. We find that the combination of segmentation and classification networks in a single pipeline yields performance comparable to existing computational pipelines, while providing the additional benefit of an end-to-end, fully-automated analysis toolbox for ciliary motion.