Accuracy
Blood test shows promise for detecting the deadliest cancers early
A blood test developed and checked using blood samples from 4000 people can accurately detect more than 50 cancer types, often before any symptoms appear. It was most accurate at identifying 12 especially dangerous types, including pancreatic cancers that are usually diagnosed only at a very late stage. Many groups around the world are trying to develop blood tests for cancer, often referred to as "liquid biopsies". Michael Seiden at US Oncology, a company involved in cancer care, and his team explored several ways of testing for cancer based on sequencing the DNA that dying cells release into the bloodstream. The team found that looking at methylation patterns at around a million sites was the most promising.
Improving Emergency Department ESI Acuity Assignment Using Machine Learning and Clinical Natural Language Processing
Ivanov, Oleksandr, Wolf, Lisa, Brecher, Deena, Masek, Kevin, Lewis, Erica, Liu, Stephen, Dunne, Robert B, Klauer, Kevin, Montgomery, Kyla, Andrieiev, Yurii, McLaughlin, Moss, Reilly, Christian
Effective triage is critical to mitigating the effect of increased volume by accurately determining patient acuity, need for resources, and establishing effective acuity-based patient prioritization. The purpose of this retrospective study was to determine whether historical EHR data can be extracted and synthesized with clinical natural language processing (C-NLP) and the latest ML algorithms (KATE) to produce highly accurate ESI predictive models. An ML model (KATE) for the triage process was developed using 166,175 patient encounters from two participating hospitals. The model was then tested against a gold set that was derived from a random sample of triage encounters at the study sites and correct acuity assignments were recorded by study clinicians using the Emergency Severity Index (ESI) standard as a guide. At the two study sites, KATE predicted accurate ESI acuity assignments 75.9% of the time, compared to nurses (59.8%) and average individual study clinicians (75.3%). KATE accuracy was 26.9% higher than the average nurse accuracy (p-value < 0.0001). On the boundary between ESI 2 and ESI 3 acuity assignments, which relates to the risk of decompensation, KATE was 93.2% higher with 80% accuracy, compared to triage nurses with 41.4% accuracy (p-value < 0.0001). KATE provides a triage acuity assignment substantially more accurate than the triage nurses in this study sample. KATE operates independently of contextual factors, unaffected by the external pressures that can cause under triage and may mitigate the racial and social biases that can negatively affect the accuracy of triage assignment. Future research should focus on the impact of KATE providing feedback to triage nurses in real time, KATEs impact on mortality and morbidity, ED throughput, resource optimization, and nursing outcomes.
DCMD: Distance-based Classification Using Mixture Distributions on Microbiome Data
Shestopaloff, Konstantin, Dong, Mei, Gao, Fan, Xu, Wei
Current advances in next generation sequencing techniques have allowed researchers to conduct comprehensive research on microbiome and human diseases, with recent studies identifying associations between human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance when using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data, and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means and k-nearest neighbours framework and we identify two distance metrics that produce optimal results. The performance of the model is assessed using simulations and applied to a human microbiome study, with results compared against a number of existing machine learning and distance-based approaches. The proposed method is competitive when compared to the machine learning approaches and showed a clear improvement over commonly used distance-based classifiers. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data.
AI program could check blood for signs of lung cancer
Scientists have developed an artificial intelligence program that can screen people for lung cancer by analysing their blood for DNA mutations that drive the disease. The software is experimental and needs to be verified in a clinical trial, but doctors are hopeful that if it proves its worth at scale, it will boost lung cancer screening rates by making the procedure as simple as a routine blood test. The program works by examining free-floating DNA that circulates in the blood. The majority of this genetic detritus enters the bloodstream when harmless cells in the body break down and spill their molecular innards, but tumours also shed DNA as they form and grow larger. The UK has no national lung cancer screening programme, but is exploring an approach adopted in the US where people who are at high risk, such as older smokers and former smokers, can have low-dose chest X-rays to check their lungs for tumours.
A Collective Learning Framework to Boost GNN Expressiveness
Hang, Mengyue, Neville, Jennifer, Ribeiro, Bruno
Graph Neural Networks (GNNs) have recently been used for node and graph classification tasks with great success, but GNNs model dependencies among the attributes of nearby neighboring nodes rather than dependencies among observed node labels. In this work, we consider the task of inductive node classification using GNNs in supervised and semi-supervised settings, with the goal of incorporating label dependencies. Because current GNNs are not universal (i.e., most-expressive) graph representations, we propose a general collective learning approach to increase the representation power of any existing GNN. Our framework combines ideas from collective classification with self-supervised learning, and uses a Monte Carlo approach to sampling embeddings for inductive learning across graphs. We evaluate performance on five real-world network datasets and demonstrate consistent, significant improvement in node classification accuracy, for a variety of state-of-the-art GNNs.
Distributed Kernel Ridge Regression with Communications
Lin, Shao-Bo, Wang, Di, Zhou, Ding-Xuan
This paper focuses on generalization performance analysis for distributed algorithms in the framework of learning theory. Taking distributed kernel ridge regression (DKRR) for example, we succeed in deriving its optimal learning rates in expectation and providing theoretically optimal ranges of the number of local processors. Due to the gap between theory and experiments, we also deduce optimal learning rates for DKRR in probability to essentially reflect the generalization performance and limitations of DKRR. Furthermore, we propose a communication strategy to improve the learning performance of DKRR and demonstrate the power of communications in DKRR via both theoretical assessments and numerical experiments.
Adversarial System Variant Approximation to Quantify Process Model Generalization
Theis, Julian, Darabi, Houshang
In process mining, process models are extracted from event logs using process discovery algorithms and are commonly assessed using multiple quality dimensions. While the metrics that measure the relationship of an extracted process model to its event log are well-studied, quantifying the level by which a process model can describe the unobserved behavior of its underlying system falls short in the literature. In this paper, a novel deep learning-based methodology called Adversarial System Variant Approximation (AVATAR) is proposed to overcome this issue. Sequence Generative Adversarial Networks are trained on the variants contained in an event log with the intention to approximate the underlying variant distribution of the system behavior. Unobserved realistic variants are sampled either directly from the Sequence Generative Adversarial Network or by leveraging the Metropolis-Hastings algorithm. The degree by which a process model relates to its underlying unknown system behavior is then quantified based on the realistic observed and estimated unobserved variants using established process model quality metrics. Significant performance improvements in revealing realistic unobserved variants are demonstrated in a controlled experiment on 15 ground truth systems. Additionally, the proposed methodology is experimentally tested and evaluated to quantify the generalization of 60 discovered process models with respect to their systems.
MIM-Based Generative Adversarial Networks and Its Application on Anomaly Detection
In terms of Generative Adversarial Networks (GANs), the information metric to discriminate the generative data and the real data, lies in the key point of generation efficiency, which plays an important role in GAN-based applications, especially in anomaly detection. As for the original GAN, the information metric based on Kullback-Leibler (KL) divergence has limitations on rare events generation and training performance for adversarial networks. Therefore, it is significant to investigate the metrics used in GANs to improve the generation ability as well as bring gains in the training process. In this paper, we adopt the exponential form, referred from the Message Importance Measure (MIM), to replace the logarithm form of the original GAN. This approach named MIM-based GAN, has dominant performance on training process and rare events generation. Specifically, we first discuss the characteristics of training process in this approach. Moreover, we also analyze its advantages on generating rare events in theory. In addition, we do simulations on the datasets of MNIST and ODDS to see that the MIM-based GAN achieves state-of-the-art performance on anomaly detection compared with some classical GANs.
Boosting Ridge Regression for High Dimensional Data Classification
Ridge regression is a well established regression estimator which can conveniently be adapted for classification problems. One compelling reason is probably the fact that ridge regression emits a closed-form solution thereby facilitating the training phase. However in the case of high-dimensional problems, the closed-form solution which involves inverting the regularised covariance matrix is rather expensive to compute. The high computational demand of such operation also renders difficulty in constructing ensemble of ridge regressions. In this paper, we consider learning an ensemble of ridge regressors where each regressor is trained in its own randomly projected subspace. Subspace regressors are later combined via adaptive boosting methodology. Experiments based on five high-dimensional classification problems demonstrated the effectiveness of the proposed method in terms of learning time and in some cases improved predictive performance can be observed.
Logistic Regression examples in python & R
In every algorithm of machine learning, there is an approach that is unique yet easily interpretable. Logistic regression is one such algorithm with an easy and unique approach. It is very often used in the credit and risk industry for its easy intuition on predicting the chances of default and risk cases. It is indeed quite a challenge to break down most of the algorithms due to their black-box nature and their hard to find parameters, but logistic regression outperforms all. So it is time to break down the entire algorithm and draw some inferences.