Accuracy
Surrogate Aided Unsupervised Recovery of Sparse Signals in Single Index Models for Binary Outcomes
Chakrabortty, Abhishek, Neykov, Matey, Carroll, Raymond, Cai, Tianxi
We consider the recovery of regression coefficients, denoted by $\boldsymbol{\beta}_0$, for a single index model (SIM) relating a binary outcome $Y$ to a set of possibly high dimensional covariates $\boldsymbol{X}$, based on a large but 'unlabeled' dataset $\mathcal{U}$, with $Y$ never observed. On $\mathcal{U}$, we fully observe $\boldsymbol{X}$ and additionally, a surrogate $S$ which, while not being strongly predictive of $Y$ throughout the entirety of its support, can forecast it with high accuracy when it assumes extreme values. Such datasets arise naturally in modern studies involving large databases such as electronic medical records (EMR) where $Y$, unlike $(\boldsymbol{X}, S)$, is difficult and/or expensive to obtain. In EMR studies, an example of $Y$ and $S$ would be the true disease phenotype and the count of the associated diagnostic codes respectively. Assuming another SIM for $S$ given $\boldsymbol{X}$, we show that under sparsity assumptions, we can recover $\boldsymbol{\beta}_0$ proportionally by simply fitting a least squares LASSO estimator to the subset of the observed data on $(\boldsymbol{X}, S)$ restricted to the extreme sets of $S$, with $Y$ imputed using the surrogacy of $S$. We obtain sharp finite sample performance bounds for our estimator, including deterministic deviation bounds and probabilistic guarantees. We demonstrate the effectiveness of our approach through multiple simulation studies, as well as by application to real data from an EMR study conducted at the Partners HealthCare Systems.
Statistical Latent Space Approach for Mixed Data Modelling and Applications
Nguyen, Tu Dinh, Tran, Truyen, Phung, Dinh, Venkatesh, Svetha
The analysis of mixed data has been raising challenges in statistics and machine learning. One of two most prominent challenges is to develop new statistical techniques and methodologies to effectively handle mixed data by making the data less heterogeneous with minimum loss of information. The other challenge is that such methods must be able to apply in large-scale tasks when dealing with huge amount of mixed data. To tackle these challenges, we introduce parameter sharing and balancing extensions to our recent model, the mixed-variate restricted Boltzmann machine (MV.RBM) which can transform heterogeneous data into homogeneous representation. We also integrate structured sparsity and distance metric learning into RBM-based models. Our proposed methods are applied in various applications including latent patient profile modelling in medical data analysis and representation learning for image retrieval. The experimental results demonstrate the models perform better than baseline methods in medical data and outperform state-of-the-art rivals in image dataset.
Statistical Anomaly Detection via Composite Hypothesis Testing for Markov Models
Zhang, Jing, Paschalidis, Ioannis Ch.
Under Markovian assumptions, we leverage a Central Limit Theorem (CLT) for the empirical measure in the test statistic of the composite hypothesis Hoeffding test so as to establish weak convergence results for the test statistic, and, thereby, derive a new estimator for the threshold needed by the test. We first show the advantages of our estimator over an existing estimator by conducting extensive numerical experiments. We find that our estimator controls better for false alarms while maintaining satisfactory detection probabilities. We then apply the Hoeffding test with our threshold estimator to detecting anomalies in two distinct applications domains: one in communication networks and the other in transportation networks. The former application seeks to enhance cyber security and the latter aims at building smarter transportation systems in cities.
Improving your statistical inferences Coursera
About this course: This course aims to help you to draw better statistical inferences from empirical research. First, we will discuss how to correctly interpret p-values, effect sizes, confidence intervals, Bayes Factors, and likelihood ratios, and how these statistics answer different questions you might be interested in. Then, you will learn how to design experiments where the false positive rate is controlled, and how to decide upon the sample size for your study, for example in order to achieve high statistical power. Subsequently, you will learn how to interpret evidence in the scientific literature given widespread publication bias, for example by learning about p-curve analysis. Finally, we will talk about how to do philosophy of science, theory construction, and cumulative science, including how to perform replication studies, why and how to pre-register your experiment, and how to share your results following Open Science principles. In practical, hands on assignments, you will learn how to simulate t-tests to learn which p-values you can expect, calculate likelihood ratio's and get an introduction the binomial Bayesian statistics, and learn about the positive predictive value which expresses the probability published research findings are true.
WWE SummerSlam 2017: Predictions, Match Card For Biggest PPV Since WrestleMania
WWE's second-biggest show of the year is set for Sunday night at Barclays Center with SummerSlam 2017. The build towards the pay-per-view has been a disappointing one, though there's plenty of hype surrounding the main event. Brock Lesnar will defend his WWE Universal Championship in what's sure to be the night's final match, and the Intercontinental Championship is the only belt that isn't scheduled to be on the line in Brooklyn. Below are predictions for every match on the SummerSlam card, which features wrestlers from both "Monday Night Raw" and "SmackDown Live." Brock Lesnar, pictured before his fight with Mark Hunt at UFC 200 at T-Mobile Arena on July 9, 2016 in Las Vegas, will probably win the WWE Universal Championship Match at SummerSlam 2017.
Optimal Alarms for Vehicular Collision Detection
Motro, Michael, Ghosh, Joydeep, Bhat, Chandra
Recent advances in in-vehicle awareness have end uses such as messages or warnings to drivers, automated braking or control, or fully driverless vehicles. There are similarly many sensors and communication devices that can provide awareness, and many models of traffic motion or human action that add predictive power. As there are many possible approaches, a single unified framework for intelligent vehicle design seems unlikely in the near future. However, there are certain tasks that are important for a variety of intelligent vehicle applications and (relatively) independent of the individual sensors or models used. One such task is vehicular collision detection: given the current position and state of two or more vehicles and a predictive model for their future motion, determine whether there is a significant chance of collision between vehicles in the near future. This task may sound trivial and is indeed simpler than the problems of scene reconstruction, predictive modeling or path planning. This simplicity allows vehicular collision detection to be framed as a self-contained task, with solutions that compromise between speed and robustness. Collision detection closely matches the theoretical problem of optimal alarm design [1], [2]. Optimal alarms were initially studied in the context of detecting bankruptcies or machine part failures [3] - critical events that should be detected in advance with high probability, much like collisions.
Decision Trees and Random Forests for Classification and Regression pt.1
Want to use something more interpertable, something that trains faster and performs pretty much just as well as the old Logistic Regression or even Neural Networks? You should consider Decision Trees for classification and regression. Decision Trees and their extension Random Forests are robust and easy-to-interpret machine learning algorithms for Classification and Regression tasks. Decision Trees and Decision Tree Learning together comprise a simple and fast way of learning a function that maps data x to outputs y, where x can be a mix of categorical and numeric variables and y can be categorical for classification, or numeric for regression. Methods such as SVMs, Logistic Regression and Deep Neural Nets pretty much do the same thing.
Scalable Joint Models for Reliable Uncertainty-Aware Event Prediction
Soleimani, Hossein, Hensman, James, Saria, Suchi
Missing data and noisy observations pose significant challenges for reliably predicting events from irregularly sampled multivariate time series (longitudinal) data. Imputation methods, which are typically used for completing the data prior to event prediction, lack a principled mechanism to account for the uncertainty due to missingness. Alternatively, state-of-the-art joint modeling techniques can be used for jointly modeling the longitudinal and event data and compute event probabilities conditioned on the longitudinal observations. These approaches, however, make strong parametric assumptions and do not easily scale to multivariate signals with many observations. Our proposed approach consists of several key innovations. First, we develop a flexible and scalable joint model based upon sparse multiple-output Gaussian processes. Unlike state-of-the-art joint models, the proposed model can explain highly challenging structure including non-Gaussian noise while scaling to large data. Second, we derive an optimal policy for predicting events using the distribution of the event occurrence estimated by the joint model. The derived policy trades-off the cost of a delayed detection versus incorrect assessments and abstains from making decisions when the estimated event probability does not satisfy the derived confidence criteria. Experiments on a large dataset show that the proposed framework significantly outperforms state-of-the-art techniques in event prediction.
Learning to Plan Chemical Syntheses
Segler, Marwin H. S., Preuss, Mike, Waller, Mark P.
From medicines to materials, small organic molecules are indispensable for human well-being. To plan their syntheses, chemists employ a problem solving technique called retrosynthesis. In retrosynthesis, target molecules are recursively transformed into increasingly simpler precursor compounds until a set of readily available starting materials is obtained. Computer-aided retrosynthesis would be a highly valuable tool, however, past approaches were slow and provided results of unsatisfactory quality. Here, we employ Monte Carlo Tree Search (MCTS) to efficiently discover retrosynthetic routes. MCTS was combined with an expansion policy network that guides the search, and an "in-scope" filter network to pre-select the most promising retrosynthetic steps. These deep neural networks were trained on 12 million reactions, which represents essentially all reactions ever published in organic chemistry. Our system solves almost twice as many molecules and is 30 times faster in comparison to the traditional search method based on extracted rules and hand-coded heuristics. Finally after a 60 year history of computer-aided synthesis planning, chemists can no longer distinguish between routes generated by a computer system and real routes taken from the scientific literature. We anticipate that our method will accelerate drug and materials discovery by assisting chemists to plan better syntheses faster, and by enabling fully automated robot synthesis.
Collaborative Filtering using Denoising Auto-Encoders for Market Basket Data
Abad, Andres G., Reyes-Castro, Luis I.
Recommender systems (RS) help users navigate large sets of items in the search for "interesting" ones. One approach to RS is Collaborative Filtering (CF), which is based on the idea that similar users are interested in similar items. Most model-based approaches to CF seek to train a machine-learning/data-mining model based on sparse data; the model is then used to provide recommendations. While most of the proposed approaches are effective for small-size situations, the combinatorial nature of the problem makes it impractical for medium-to-large instances. In this work we present a novel approach to CF that works by training a Denoising Auto-Encoder (DAE) on corrupted baskets, i.e., baskets from which one or more items have been removed. The DAE is then forced to learn to reconstruct the original basket given its corrupted input. Due to recent advancements in optimization and other technologies for training neural-network models (such as DAE), the proposed method results in a scalable and practical approach to CF. The contribution of this work is twofold: (1) to identify missing items in observed baskets and, thus, directly providing a CF model; and, (2) to construct a generative model of baskets which may be used, for instance, in simulation analysis or as part of a more complex analytical method.