Tournament Leave-pair-out Cross-validation for Receiver Operating Characteristic (ROC) Analysis

arXiv.org Machine Learning

Receiver operating characteristic (ROC) analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under ROC curve (AUC) with standard cross-validation methods suffers from a large bias. The leave-pair-out (LPO) cross-validation has been shown to correct this bias. However, while LPO produces an almost unbiased estimate of AUC, it does not provide a ranking of the data needed for plotting and analyzing the ROC curve. In this study, we propose a new method called tournament leave-pair-out (TLPO) cross-validation. This method extends LPO by creating a tournament from pair comparisons to produce a ranking for the data. TLPO preserves the advantage of LPO for estimating AUC, while it also allows performing ROC analysis. We have shown using both synthetic and real world data that TLPO is as reliable as LPO for AUC estimation and confirmed the bias in leave-one-out cross-validation on low-dimensional data.


Bayesian Multi Plate High Throughput Screening of Compounds

arXiv.org Machine Learning

High throughput screening of compounds (chemicals) is an essential part of drug discovery [7], involving thousands to millions of compounds, with the purpose of identifying candidate hits. Most statistical tools, including the industry standard B-score method, work on individual compound plates and do not exploit cross-plate correlation or statistical strength among plates. We present a new statistical framework for high throughput screening of compounds based on Bayesian nonparametric modeling. The proposed approach is able to identify candidate hits from multiple plates simultaneously, sharing statistical strength among plates and providing more robust estimates of compound activity. It can flexibly accommodate arbitrary distributions of compound activities and is applicable to any plate geometry. The algorithm provides a principled statistical approach for hit identification and false discovery rate control. Experiments demonstrate significant improvements in hit identification sensitivity and specificity over the B-score method, which is highly sensitive to threshold choice. The framework is implemented as an efficient R extension package BHTSpack and is suitable for large scale data sets.


Learning modular structures from network data and node variables

arXiv.org Machine Learning

A standard technique for understanding underlying dependency structures among a set of variables posits a shared conditional probability distribution for the variables measured on individuals within a group. This approach is often referred to as module networks, where individuals are represented by nodes in a network, groups are termed modules, and the focus is on estimating the network structure among modules. However, estimation solely from node-specific variables can lead to spurious dependencies, and unverifiable structural assumptions are often used for regularization. Here, we propose an extended model that leverages direct observations about the network in addition to node-specific variables. By integrating complementary data types, we avoid the need for structural assumptions. We illustrate theoretical and practical significance of the model and develop a reversible-jump MCMC learning procedure for learning modules and model parameters. We demonstrate the method accuracy in predicting modular structures from synthetic data and capability to learn influence structures in twitter data and regulatory modules in the Mycobacterium tuberculosis gene regulatory network.


Decision Support System for Renal Transplantation

arXiv.org Machine Learning

The burgeoning need for kidney transplantation mandates immediate attention. Mismatch of deceased donor-recipient kidney leads to post-transplant death. To ensure ideal kidney donor-recipient match and minimize post-transplant deaths, the paper develops a prediction model that identifies factors that determine the probability of success of renal transplantation, that is, if the kidney procured from the deceased donor can be transplanted or discarded. The paper conducts a study enveloping data for 584 imported kidneys collected from 12 transplant centers associated with an organ procurement organization located in New York City, NY. The predicting model yielding best performance measures can be beneficial to the healthcare industry. Transplant centers and organ procurement organizations can take advantage of the prediction model to efficiently predict the outcome of kidney transplantation. Consequently, it will reduce the mortality rate caused by mismatching of donor-recipient kidney transplantation during the surgery.


A tree augmented naive Bayesian network experiment for breast cancer prediction

arXiv.org Machine Learning

In order to investigate the breast cancer prediction problem on the aging population with the grades of DCIS, we conduct a tree augmented naive Bayesian network experiment trained and tested on a large clinical dataset including consecutive diagnostic mammography examinations, consequent biopsy outcomes and related cancer registry records in the population of women across all ages. The aggregated results of our ten-fold cross validation method recommend a biopsy threshold higher than 2% for the aging population.