Performance Analysis
AI uses bitcoin trail to find and help sex-trafficking victims
After Kubiiki Pride's 13-year-old daughter disappeared, it took 270 days for her mother to find her. When she did, it was as an escort available to be rented out on an online classified web site. Her daughter had been drugged and beaten into compliance by a sex trafficker. To find her, Pride had to trawl through hundreds of advertisements on Backpage.com, When it comes to identifying signs of human trafficking in online sex adverts, the task for police is often no easier.
An Ensemble Classifier for Predicting the Onset of Type II Diabetes
Semerdjian, John, Frank, Spencer
Short Video Abstract Prediction of disease onset from patient survey and lifestyle data is quickly becoming an important tool for diagnosing a disease before it progresses. In this study, data from the National Health and Nutrition Examination Survey (NHANES) questionnaire is used to predict the onset of type II diabetes. An ensemble model using the output of five classification algorithms was developed to predict the onset on diabetes based on 16 features. The ensemble model had an AUC of 0.834 indicating high performance.
Classification via Tensor Decompositions of Echo State Networks
This work introduces a tensor-based method to perform supervised classification on spatiotemporal data processed in an echo state network. Typically when performing supervised classification tasks on data processed in an echo state network, the entire collection of hidden layer node states from the training dataset is shaped into a matrix, allowing one to use standard linear algebra techniques to train the output layer. However, the collection of hidden layer states is multidimensional in nature, and representing it as a matrix may lead to undesirable numerical conditions or loss of spatial and temporal correlations in the data. This work proposes a tensor-based supervised classification method on echo state network data that preserves and exploits the multidimensional nature of the hidden layer states. The method, which is based on orthogonal Tucker decompositions of tensors, is compared with the standard linear output weight approach in several numerical experiments on both synthetic and natural data. The results show that the tensor-based approach tends to outperform the standard approach in terms of classification accuracy.
Microsoft's AI is getting crazily good at speech recognition
Microsoft (probably) knows what you're saying. Microsoft's speech recognition efforts have hit a significant milestone. It can now transcribe human speech with a 5.1% error rate, Microsoft technical fellow Xuedong Huang wrote in a blog post -- the same error rate as humans. Microsoft actually thought it hit this point last year, when it reached 5.9%, the word error rate it had measured for humans. But then other researchers carried out separate studies and pegged the human error level as slightly lower, 5.1%.
WWE SummerSlam 2017: Betting Odds, Start Time, Live Stream Info For PPV
The build towards WWE SummerSlam 2017 hasn't been as entertaining as it's been in recent years, but you might not know that by looking at the card. The pay-per-view is almost being treated like WrestleMania, considering it features 12 advertised matches and could last longer than five hours. The kickoff show starts at 6 p.m. EDT, and the actual pay-per-view gets underway an hour later at 7 p.m. EDT at Barclays Center in Brooklyn. Fans can either watch the SummerSlam with a live stream on the WWE Network, or they can order the PPV for $54.99. A subscription to the network costs $9.99 per month, though new subscribers get the first month free.
Surrogate Aided Unsupervised Recovery of Sparse Signals in Single Index Models for Binary Outcomes
Chakrabortty, Abhishek, Neykov, Matey, Carroll, Raymond, Cai, Tianxi
We consider the recovery of regression coefficients, denoted by $\boldsymbol{\beta}_0$, for a single index model (SIM) relating a binary outcome $Y$ to a set of possibly high dimensional covariates $\boldsymbol{X}$, based on a large but 'unlabeled' dataset $\mathcal{U}$, with $Y$ never observed. On $\mathcal{U}$, we fully observe $\boldsymbol{X}$ and additionally, a surrogate $S$ which, while not being strongly predictive of $Y$ throughout the entirety of its support, can forecast it with high accuracy when it assumes extreme values. Such datasets arise naturally in modern studies involving large databases such as electronic medical records (EMR) where $Y$, unlike $(\boldsymbol{X}, S)$, is difficult and/or expensive to obtain. In EMR studies, an example of $Y$ and $S$ would be the true disease phenotype and the count of the associated diagnostic codes respectively. Assuming another SIM for $S$ given $\boldsymbol{X}$, we show that under sparsity assumptions, we can recover $\boldsymbol{\beta}_0$ proportionally by simply fitting a least squares LASSO estimator to the subset of the observed data on $(\boldsymbol{X}, S)$ restricted to the extreme sets of $S$, with $Y$ imputed using the surrogacy of $S$. We obtain sharp finite sample performance bounds for our estimator, including deterministic deviation bounds and probabilistic guarantees. We demonstrate the effectiveness of our approach through multiple simulation studies, as well as by application to real data from an EMR study conducted at the Partners HealthCare Systems.
Statistical Latent Space Approach for Mixed Data Modelling and Applications
Nguyen, Tu Dinh, Tran, Truyen, Phung, Dinh, Venkatesh, Svetha
The analysis of mixed data has been raising challenges in statistics and machine learning. One of two most prominent challenges is to develop new statistical techniques and methodologies to effectively handle mixed data by making the data less heterogeneous with minimum loss of information. The other challenge is that such methods must be able to apply in large-scale tasks when dealing with huge amount of mixed data. To tackle these challenges, we introduce parameter sharing and balancing extensions to our recent model, the mixed-variate restricted Boltzmann machine (MV.RBM) which can transform heterogeneous data into homogeneous representation. We also integrate structured sparsity and distance metric learning into RBM-based models. Our proposed methods are applied in various applications including latent patient profile modelling in medical data analysis and representation learning for image retrieval. The experimental results demonstrate the models perform better than baseline methods in medical data and outperform state-of-the-art rivals in image dataset.
Statistical Anomaly Detection via Composite Hypothesis Testing for Markov Models
Zhang, Jing, Paschalidis, Ioannis Ch.
Under Markovian assumptions, we leverage a Central Limit Theorem (CLT) for the empirical measure in the test statistic of the composite hypothesis Hoeffding test so as to establish weak convergence results for the test statistic, and, thereby, derive a new estimator for the threshold needed by the test. We first show the advantages of our estimator over an existing estimator by conducting extensive numerical experiments. We find that our estimator controls better for false alarms while maintaining satisfactory detection probabilities. We then apply the Hoeffding test with our threshold estimator to detecting anomalies in two distinct applications domains: one in communication networks and the other in transportation networks. The former application seeks to enhance cyber security and the latter aims at building smarter transportation systems in cities.
Improving your statistical inferences Coursera
About this course: This course aims to help you to draw better statistical inferences from empirical research. First, we will discuss how to correctly interpret p-values, effect sizes, confidence intervals, Bayes Factors, and likelihood ratios, and how these statistics answer different questions you might be interested in. Then, you will learn how to design experiments where the false positive rate is controlled, and how to decide upon the sample size for your study, for example in order to achieve high statistical power. Subsequently, you will learn how to interpret evidence in the scientific literature given widespread publication bias, for example by learning about p-curve analysis. Finally, we will talk about how to do philosophy of science, theory construction, and cumulative science, including how to perform replication studies, why and how to pre-register your experiment, and how to share your results following Open Science principles. In practical, hands on assignments, you will learn how to simulate t-tests to learn which p-values you can expect, calculate likelihood ratio's and get an introduction the binomial Bayesian statistics, and learn about the positive predictive value which expresses the probability published research findings are true.