Goto

Collaborating Authors

 Accuracy


Predicting Flights Delay Using Supervised Learning, Logistic Regression

@machinelearnbot

In this post, we'll use a supervised machine learning technique called logistic regression to predict delayed flights. But before we proceed, I like to give condolences to the family of the the victims of the Germanwings tragedy. Note: This is a common data set in the machine learning community to test out algorithms and models given it's publicly available and have sizable data. In this blog, we will look at small sample snapsot(2201 flights in January 2004). In another post, we can explore using Big Data technologies such as Hadoop MapReduce or Spark machine learning libraries to do large scale predictive analytics and data mining.


An Ensemble Boosting Model for Predicting Transfer to the Pediatric Intensive Care Unit

arXiv.org Machine Learning

Our work focuses on the problem of predicting the transfer of pediatric patients from the general ward of a hospital to the pediatric intensive care unit. Using data collected over 5.5 years from the electronic health records of two medical facilities, we develop classifiers based on adaptive boosting and gradient tree boosting. We further combine these learned classifiers into an ensemble model and compare its performance to a modified pediatric early warning score (PEWS) baseline that relies on expert defined guidelines. To gauge model generalizability, we perform an inter-facility evaluation where we train our algorithm on data from one facility and perform evaluation on a hidden test dataset from a separate facility. We show that improvements are witnessed over the PEWS baseline in accuracy (0.77 vs. 0.69), sensitivity (0.80 vs. 0.68), specificity (0.74 vs. 0.70) and AUROC (0.85 vs. 0.73).


Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis

arXiv.org Machine Learning

The machine learning community adopted the use of null hypothesis significance testing (NHST) in order to ensure the statistical validity of results. Many scientific fields however realized the shortcomings of frequentist reasoning and in the most radical cases even banned its use in publications. We should do the same: just as we have embraced the Bayesian paradigm in the development of new machine learning methods, so we should also use it in the analysis of our own results. We argue for abandonment of NHST by exposing its fallacies and, more importantly, offer better - more sound and useful - alternatives for it.


What's New in MATLAB Data Analytics - MATLAB & Simulink

#artificialintelligence

Use neighborhood component analysis (NCA) to choose features for machine learning models. Manipulate and analyze data that is too big to fit in memory. Perform support vector machine (SVM) and Naive Bayes classification, create bags of decision trees, and fit lasso regression on out-of-memory data. Process big data with tall arrays in parallel on your desktop, MATLAB Distributed Computing Server, and Spark clusters. Manipulate, compare, and store text data efficiently .


Predicting multicellular function through multi-layer tissue networks

arXiv.org Machine Learning

Motivation: Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine. Results: Here we present OhmNet, a hierarchy-aware unsupervised node feature learning approach for multi-layer networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale tissue hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems


Floyd Mayweather vs. Conor McGregor: Ticket Prices, PPV Cost For 2017 Fight

International Business Times

The fight between Floyd Mayweather and Conor McGregor promises to be the biggest bout of 2017, though it won't be cheap to watch. Buying the fight on pay-per-view will cost nearly $100, and no tickets can be had for fewer than $500. The upcoming Aug. 26 bout has been compared to the fight between Mayweather and Manny Pacquiao on May 2, 2015. It generated a record 4.6 million buys, even though it cost $99.95 in HD. Mayweather-McGregor will also cost $99.95,


Kernel Method for Detecting Higher Order Interactions in multi-view Data: An Application to Imaging, Genetics, and Epigenetics

arXiv.org Machine Learning

In this study, we tested the interaction effect of multimodal datasets using a novel method called the kernel method for detecting higher order interactions among biologically relevant mulit-view data. Using a semiparametric method on a reproducing kernel Hilbert space (RKHS), we used a standard mixed-effects linear model and derived a score-based variance component statistic that tests for higher order interactions between multi-view data. The proposed method offers an intangible framework for the identification of higher order interaction effects (e.g., three way interaction) between genetics, brain imaging, and epigenetic data. Extensive numerical simulation studies were first conducted to evaluate the performance of this method. Finally, this method was evaluated using data from the Mind Clinical Imaging Consortium (MCIC) including single nucleotide polymorphism (SNP) data, functional magnetic resonance imaging (fMRI) scans, and deoxyribonucleic acid (DNA) methylation data, respectfully, in schizophrenia patients and healthy controls. We treated each gene-derived SNPs, region of interest (ROI) and gene-derived DNA methylation as a single testing unit, which are combined into triplets for evaluation. In addition, cardiovascular disease risk factors such as age, gender, and body mass index were assessed as covariates on hippocampal volume and compared between triplets. Our method identified $13$-triplets ($p$-values $\leq 0.001$) that included $6$ gene-derived SNPs, $10$ ROIs, and $6$ gene-derived DNA methylations that correlated with changes in hippocampal volume, suggesting that these triplets may be important in explaining schizophrenia-related neurodegeneration. With strong evidence ($p$-values $\leq 0.000001$), the triplet ({\bf MAGI2, CRBLCrus1.L, FBXO28}) has the potential to distinguish schizophrenia patients from the healthy control variations.


On Measuring and Quantifying Performance: Error Rates, Surrogate Loss, and an Example in SSL

arXiv.org Machine Learning

The aim of semi-supervised learning is to improve supervised learners by exploiting potentially large amounts of, typically easier to obtain, unlabeled data [1]. Up to now, however, semi-supervised learners have reported mixed results when it comes to such improvements: it is not always the case that semi-supervision results in lower expected error rates. On the contrary, severely deteriorated performances have been observed in empirical studies and theory shows that improvement guarantees can often only be provided under rather stringent conditions [2-5]. Now, the principal suggestion put forward in this chapter is that, when dealing with semi-supervised learning, one may not only want to study the (expected) error rates these classifiers produce, but also to measure the classifiers' performances by means of the intrinsic loss they may be optimizing in the first place. That is, for classification routines that optimize a so-called surrogate loss at training time--which is what many machine learning and Bayesian decision theoretic approaches do [6, 7], we propose to also investigate how this loss behaves on the test set as this can provide us with an alternative view on the classifier's behavior that a mere error rate cannot capture. In fact, though the main example is concerned with semi-supervision, we would like to argue that in other learning scenarios, similar considerations might be beneficial. For instance in active learning [8], where rather than sampling randomly from ones input data to provide these instances with labels, one aims to do the sampling in a systematic way, trying to keep labeling cost as low as one can or, similarly, to learn from as few labeled examples as possible. Also here it may (or, we believe, it should) be of interest to not only compare the error rates that different approaches (e.g.


Submodular Variational Inference for Network Reconstruction

arXiv.org Machine Learning

In real-world and online social networks, individuals receive and transmit information in real time. Cascading information transmissions (e.g. phone calls, text messages, social media posts) may be understood as a realization of a diffusion process operating on the network, and its branching path can be represented by a directed tree. The process only traverses and thus reveals a limited portion of the edges. The network reconstruction/inference problem is to infer the unrevealed connections. Most existing approaches derive a likelihood and attempt to find the network topology maximizing the likelihood, a problem that is highly intractable. In this paper, we focus on the network reconstruction problem for a broad class of real-world diffusion processes, exemplified by a network diffusion scheme called respondent-driven sampling (RDS). We prove that under realistic and general models of network diffusion, the posterior distribution of an observed RDS realization is a Bayesian log-submodular model.We then propose VINE (Variational Inference for Network rEconstruction), a novel, accurate, and computationally efficient variational inference algorithm, for the network reconstruction problem under this model. Crucially, we do not assume any particular probabilistic model for the underlying network. VINE recovers any connected graph with high accuracy as shown by our experimental results on real-life networks.


WWE Great Balls Of Fire 2017: Live Stream Info, Start Time, Match Card For 'Monday Night Raw' PPV

International Business Times

Four WWE superstars are unofficially in the running to be in the main event of SummerSlam 2017 on Aug. 20. The biggest match since WrestleMania 33 will be determined Sunday night at Great Balls of Fire 2017, WWE's next "Monday Night Raw" pay-per-view. Great Balls of Fire 2017 is scheduled to start at 8 p.m. EDT, and the pre-show gets underway at 7:30 p.m. EDT. Ordering the event on PPV costs $54.99, but fans can also watch the event with a live stream on the WWE Network. A subscription to the network costs $9.99 per month, though new subscribers get the first month free.