Goto

Collaborating Authors

 Accuracy


[P] The unreasonable usefulness of deep learning in building and cleaning medical image datasets • r/MachineLearning

@machinelearnbot

One thing I find weird is that we have lots of discussion of deep learning in complex detection and recognition tasks, but very few people talk about how useful deep learning can be for simple but time consuming image data processing tasks, particularly in medical research. In this post I spend a bit of time cleaning up the CXR14 dataset, and in 4 hours find 430 images with various problems that shouldn't be in the dataset (a csv identifying these images is included in the post). While the prevalence of these problems is super low ( 50/100,000), since the visual challenge is very easy the models can achieve absurdly low false positive rates. I even get an AUROC of 1.0 in a 2000 image validation set on one task:) In doing so, cleaning this dataset to remove 3 different problems didn't take me weeks to pore through each image, but under a day. Certainly nothing in the post is technically groundbreaking, but it is hopefully a prompt to consider deep learning when you are doing time consuming processing.


Joint Bootstrapping Machines for High Confidence Relation Extraction

arXiv.org Artificial Intelligence

Semi-supervised bootstrapping techniques for relationship extraction from text iteratively expand a set of initial seed instances. Due to the lack of labeled data, a key challenge in bootstrapping is semantic drift: if a false positive instance is added during an iteration, then all following iterations are contaminated. We introduce BREX, a new bootstrapping method that protects against such contamination by highly effective confidence assessment. This is achieved by using entity and template seeds jointly (as opposed to just one as in previous work), by expanding entities and templates in parallel and in a mutually constraining fashion in each iteration and by introducing higherquality similarity measures for templates. Experimental results show that BREX achieves an F1 that is 0.13 (0.87 vs. 0.74) better than the state of the art for four relationships.


Has your machine really learned something? Snap quiz time

#artificialintelligence

Machine learning (ML) is all about getting machines to learn but how do we know how well they are doing? Suppose we have existing data about people who buy books from our website – age, amount spent, preferred author and so on. These columns of data are known, in ML terms, as the "Predictors". For some of the customers we also know their gender, for others we don't, but we want to know it for all customers. We start with the customers where we do know the gender and we feed both the Predictor and Response data into a machine learning algorithm and get it to build a model that can (hopefully) forecast the Response from the Predictors.


A Guide to Constraining Effective Field Theories with Machine Learning

arXiv.org Machine Learning

We develop, discuss, and compare several inference techniques to constrain theory parameters in collider experiments. By harnessing the latent-space structure of particle physics processes, we extract extra information from the simulator. This augmented data can be used to train neural networks that precisely estimate the likelihood ratio. The new methods scale well to many observables and high-dimensional parameter spaces, do not require any approximations of the parton shower and detector response, and can be evaluated in microseconds. Using weak-boson-fusion Higgs production as an example process, we compare the performance of several techniques. The best results are found for likelihood ratio estimators trained with extra information about the score, the gradient of the log likelihood function with respect to the theory parameters. The score also provides sufficient statistics that contain all the information needed for inference in the neighborhood of the Standard Model. These methods enable us to put significantly stronger bounds on effective dimension-six operators than the traditional approach based on histograms. They also outperform generic machine learning methods that do not make use of the particle physics structure, demonstrating their potential to substantially improve the new physics reach of the LHC legacy results.


Interpreting weight maps in terms of cognitive or clinical neuroscience: nonsense?

arXiv.org Machine Learning

Linear machine learning models can be seen as providing two outputs: predictions and weight maps. The latter shows the relative contribution of the individual features to the model and has been heavily used in the neuroimaging community to infer conclusions about brain structure/function. There has however been a recent debate on whether weight maps can provide information about the neural signals leading to a significant classification/regression model [1]-[3]. The authors of [1] indeed suggest that weight maps provide a poor recovery of the input neural signal and lead to false positives. They further demonstrate that the amplitude of the weight does not reflect the amplitude of the signal difference in a feature. However, their examples are specific cases with low signalto-noise ratio (SNR). Here, we investigate the recovery of two widespread techniques, namely SVM [4] and sparse MKL [5] when varying the SNR, as well as the distribution of simulated neural signals.


Scalable Angular Discriminative Deep Metric Learning for Face Recognition

arXiv.org Artificial Intelligence

With the development of deep learning, Deep Metric Learning (DML) has achieved great improvements in face recognition. Specifically, the widely used softmax loss in the training process often bring large intra-class variations, and feature normalization is only exploited in the testing process to compute the pair similarities. To bridge the gap, we impose the intra-class cosine similarity between the features and weight vectors in softmax loss larger than a margin in the training step, and extend it from four aspects. First, we explore the effect of a hard sample mining strategy. To alleviate the human labor of adjusting the margin hyper-parameter, a self-adaptive margin updating strategy is proposed. Then, a normalized version is given to take full advantage of the cosine similarity constraint. Furthermore, we enhance the former constraint to force the intra-class cosine similarity larger than the mean inter-class cosine similarity with a margin in the exponential feature projection space. Extensive experiments on Labeled Face in the Wild (LFW), Youtube Faces (YTF) and IARPA Janus Benchmark A (IJB-A) datasets demonstrate that the proposed methods outperform the mainstream DML methods and approach the state-of-the-art performance.


OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction

arXiv.org Artificial Intelligence

Motivation: Ontologies are widely used in biology for data annotation, integration, and analysis. In addition to formally structured axioms, ontologies contain meta-data in the form of annotation axioms which provide valuable pieces of information that characterize ontology classes. Annotations commonly used in ontologies include class labels, descriptions, or synonyms. Despite being a rich source of semantic information, the ontology meta-data are generally unexploited by ontology-based analysis methods such as semantic similarity measures. Results: We propose a novel method, OPA2Vec, to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axioms from the ontology meta-data. We apply a Word2Vec model that has been pre-trained on PubMed abstracts to produce feature vectors from our collected data. We validate our method in two different ways: first, we use the obtained vector representations of proteins as a similarity measure to predict protein-protein interaction (PPI) on two different datasets. Second, we evaluate our method on predicting gene-disease associations based on phenotype similarity by generating vector representations of genes and diseases using a phenotype ontology, and applying the obtained vectors to predict gene-disease associations. These two experiments are just an illustration of the possible applications of our method. OPA2Vec can be used to produce vector representations of any biomedical entity given any type of biomedical ontology. Availability: https://github.com/bio-ontology-research-group/opa2vec Contact: robert.hoehndorf@kaust.edu.sa and xin.gao@kaust.edu.sa.


Credit risk prediction in an imbalanced social lending environment

arXiv.org Machine Learning

Credit risk prediction is an effective way of evaluating whether a potential borrower will repay a loan, particularly in peer-to-peer lending where class imbalance problems are prevalent. However, few credit risk prediction models for social lending consider imbalanced data and, further, the best resampling technique to use with imbalanced data is still controversial. In an attempt to address these problems, this paper presents an empirical comparison of various combinations of classifiers and resampling techniques within a novel risk assessment methodology that incorporates imbalanced data. The credit predictions from each combination are evaluated with a G-mean measure to avoid bias towards the majority class, which has not been considered in similar studies. The results reveal that combining random forest and random under-sampling may be an effective strategy for calculating the credit risk associated with loan applicants in social lending markets.


Drug Similarity Integration Through Attentive Multi-view Graph Auto-Encoders

arXiv.org Machine Learning

Drug similarity has been studied to support downstream clinical tasks such as inferring novel properties of drugs (e.g. side effects, indications, interactions) from known properties. The growing availability of new types of drug features brings the opportunity of learning a more comprehensive and accurate drug similarity that represents the full spectrum of underlying drug relations. However, it is challenging to integrate these heterogeneous, noisy, nonlinear-related information to learn accurate similarity measures especially when labels are scarce. Moreover, there is a trade-off between accuracy and interpretability. In this paper, we propose to learn accurate and interpretable similarity measures from multiple types of drug features. In particular, we model the integration using multi-view graph auto-encoders, and add attentive mechanism to determine the weights for each view with respect to corresponding tasks and features for better interpretability. Our model has flexible design for both semi-supervised and unsupervised settings. Experimental results demonstrated significant predictive accuracy improvement. Case studies also showed better model capacity (e.g. embed node features) and interpretability.


Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm

#artificialintelligence

Machine learning and data science require more than just throwing data into a python library and utilizing whatever comes out. Data scientists need to actually understand the data and the processes behind the data to be able to implement a successful system. One key methodology to implementation is knowing when a model might benefit from utilizing bootstrapping methods. These are what are called ensemble models. Some examples of ensemble models are AdaBoost and Stochastic Gradient Boosting. They can help improve algorithm accuracy or improve the robustness of a model.