Accuracy
Molecular Graph Convolutions: Moving Beyond Fingerprints
Kearnes, Steven, McCloskey, Kevin, Berndl, Marc, Pande, Vijay, Riley, Patrick
Molecular "fingerprints" encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular "graph convolutions", a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph---atoms, bonds, distances, etc.---which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.
Application of multiview techniques to NHANES dataset
Research into disease-related health variables typically involve choosing health variables and conditions, and using statistical methods to study the strength of association of the variables with the condition [9]. These are then used to confirm known or suspected relationships between the behavioural/health factors or disease conditions. There may be information about health status that may be gleaned by considering different aspects of an individual's data, and investigating possible relationships between the variables. Representations that capture these relationships can be useful in predicting presence or risk level of medical conditions. The National Health and Nutrition Examination Survey (NHANES) dataset provides data on health measurements, taken from survey participants, comprising different categories including demographics, laboratory tests and physical measurements.
Random forest explained in simple terms - Listen Data
If omitted, randomForest will run in unsupervised mode. Arguments mtry: number of variables selected at each split - default sqrt(no of variables) for classification ntree: number of trees to grow: default 500 nodesize: minimum size of terminal nodes default 1 Step III: Find the number of trees where the out of bag error rate stabilizes and reach minimum. Step IV: Find the optimal number of variables selected at each split Select mtry value with minimum out of bag(OOB) error. It returns the optimal number of mtry (paramter used in randomforest package).
Why do we fall for false positives even though they're common?
Last month, the drinking water in a Colorado town was declared unsafe, because it had been contaminated by an ingredient from cannabis. It took two days to discover that this was not the case โ a water test had turned up a false positive result. In fact, false positives are widespread in our everyday lives, and we seem to have an innate inability to get to grips with them. The fuss in Hugo, Colorado โ a state where cannabis use is now legal โ began when a county employee administering a test for drug use decided to use the same kind of test on tap water, rather than saliva, in an attempt to rule out a false positive. When the water tested positive too, it was assumed the test kit was a dud.
Does quantification without adjustments work?
Classification is the task of predicting the class labels of objects based on the observation of their features. In contrast, quantification has been defined as the task of determining the prevalences of the different sorts of class labels in a target dataset. The simplest approach to quantification is Classify & Count where a classifier is optimised for classification on a training set and applied to the target dataset for the prediction of class labels. In the case of binary quantification, the number of predicted positive labels is then used as an estimate of the prevalence of the positive class in the target dataset. Since the performance of Classify & Count for quantification is known to be inferior its results typically are subject to adjustments. However, some researchers recently have suggested that Classify & Count might actually work without adjustments if it is based on a classifer that was specifically trained for quantification. We discuss the theoretical foundation for this claim and explore its potential and limitations with a numerical example based on the binormal model with equal variances. In order to identify an optimal quantifier in the binormal setting, we introduce the concept of local Bayes optimality. As a side remark, we present a complete proof of a theorem by Ye et al. (2012).
Kernel Ridge Regression via Partitioning
Tandon, Rashish, Si, Si, Ravikumar, Pradeep, Dhillon, Inderjit
In this paper, we investigate a divide and conquer approach to Kernel Ridge Regression (KRR). Given n samples, the division step involves separating the points based on some underlying disjoint partition of the input space (possibly via clustering), and then computing a KRR estimate for each partition. The conquering step is simple: for each partition, we only consider its own local estimate for prediction. We establish conditions under which we can give generalization bounds for this estimator, as well as achieve optimal minimax rates. We also show that the approximation error component of the generalization error is lesser than when a single KRR estimate is fit on the data: thus providing both statistical and computational advantages over a single KRR estimate over the entire data (or an averaging over random partitions as in other recent work, [30]). Lastly, we provide experimental validation for our proposed estimator and our assumptions.
Classification with Asymmetric Label Noise: Consistency and Maximal Denoising
Blanchard, Gilles, Flaska, Marek, Handy, Gregory, Pozzi, Sara, Scott, Clayton
In many real-world classification problems, the labels of training examples are randomly corrupted. Most previous theoretical work on classification with label noise assumes that the two classes are separable, that the label noise is independent of the true class label, or that the noise proportions for each class are known. In this work, we give conditions that are necessary and sufficient for the true class-conditional distributions to be identifiable. These conditions are weaker than those analyzed previously, and allow for the classes to be nonseparable and the noise levels to be asymmetric and unknown. The conditions essentially state that a majority of the observed labels are correct and that the true class-conditional distributions are "mutually irreducible," a concept we introduce that limits the similarity of the two distributions. For any label noise problem, there is a unique pair of true class-conditional distributions satisfying the proposed conditions, and we argue that this pair corresponds in a certain sense to maximal denoising of the observed distributions. Our results are facilitated by a connection to "mixture proportion estimation," which is the problem of estimating the maximal proportion of one distribution that is present in another. We establish a novel rate of convergence result for mixture proportion estimation, and apply this to obtain consistency of a discrimination rule based on surrogate loss minimization. Experimental results on benchmark data and a nuclear particle classification problem demonstrate the efficacy of our approach.
Multiple Instance Dictionary Learning using Functions of Multiple Instances
A multiple instance dictionary learning method using functions of multiple instances (DL-FUMI) is proposed to address target detection and two-class classification problems with inaccurate training labels. Given inaccurate training labels, DL-FUMI learns a set of target dictionary atoms that describe the most distinctive and representative features of the true positive class as well as a set of nontarget dictionary atoms that account for the shared information found in both the positive and negative instances. Experimental results show that the estimated target dictionary atoms found by DL-FUMI are more representative prototypes and identify better discriminative features of the true positive class than existing methods in the literature. DL-FUMI is shown to have significantly better performance on several target detection and classification problems as compared to other multiple instance learning (MIL) dictionary learning algorithms on a variety of MIL problems.
Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance
Helfer, Brian S., Williamson, James R., Miller, Benjamin A., Perricone, Joseph, Quatieri, Thomas F.
Studies in recent years have demonstrated that neural organization and structure impact an individual's ability to perform a given task. Specifically, individuals with greater neural efficiency have been shown to outperform those with less organized functional structure. In this work, we compare the predictive ability of properties of neural connectivity on a working memory task. We provide two novel approaches for characterizing functional network connectivity from electroencephalography (EEG), and compare these features to the average power across frequency bands in EEG channels. Our first novel approach represents functional connectivity structure through the distribution of eigenvalues making up channel coherence matrices in multiple frequency bands. Our second approach creates a connectivity network at each frequency band, and assesses variability in average path lengths of connected components and degree across the network. Failures in digit and sentence recall on single trials are detected using a Gaussian classifier for each feature set, at each frequency band. The classifier results are then fused across frequency bands, with the resulting detection performance summarized using the area under the receiver operating characteristic curve (AUC) statistic.
Time-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems
Venanzi, Matteo, Guiver, John, Kohli, Pushmeet, Jennings, Nicholas R.
Many aspects of the design of efficient crowdsourcing processes, such as defining workers bonuses, fair prices and time limits of the tasks, involve knowledge of the likely duration of the task at hand. In this work we introduce a new timesensitive Bayesian aggregation method that simultaneously estimates a tasks duration and obtains reliable aggregations of crowdsourced judgments. Our method, called BCCTime, uses latent variables to represent the uncertainty about the workers completion time, the tasks duration and the workers accuracy. To relate the quality of a judgment to the time a worker spends on a task, our model assumes that each task is completed within a latent time window within which all workers with a propensity to genuinely attempt the labelling task (i.e., no spammers) are expected to submit their judgments. In contrast, workers with a lower propensity to valid labelling, such as spammers, bots or lazy labellers, are assumed to perform tasks considerably faster or slower than the time required by normal workers. Specifically, we use efficient message-passing Bayesian inference to learn approximate posterior probabilities of (i) the confusion matrix of each worker, (ii) the propensity to valid labelling of each worker, (iii) the unbiased duration of each task and (iv) the true label of each task. Using two real- world public datasets for entity linking tasks, we show that BCCTime produces up to 11% more accurate classifications and up to 100% more informative estimates of a tasks duration compared to stateoftheart methods.