Performance Analysis
The generalised random dot product graph
Rubin-Delanchy, Patrick, Priebe, Carey E., Tang, Minh
Because they appear in virtually every facet of the digital world, there is considerable value in being able to make inference and predictions based on networks. In Statistics, such endeavours often start with a probability model, mapping unknown quantities of interest to the data, and, here, one is proposed which strikes a promising balance of generality and interpretability. Our focus is on the simplest case of modelling a graph, that is, a set of nodes and (undirected) edges. To start discussions, we consider first the benefits and drawbacks of a foundational model known as the stochastic blockmodel (Holland et al., 1983). In this model, the nodes of the graph can be grouped into k communities, such that the probability of two nodes forming an edge is dependent only on the two communities involved, and is given by a k k inter-community edge probability matrix B. Under basic exchangeability assumptions (Aldous, 1981; Hoover, 1979), the model can be regarded as providing a piecewise constant, or even histogram (Olhede and Wolfe, 2014), approximation to any random graph model satisfying basic exchangeability assumptions (Aldous, 1981; Hoover, 1979). Its generality yet simple interpretation make it a natural candidate for exploratory data analysis and the model is very popular in practice. However, one obvious issue is its discrete structure, in particular, the'hard' assignment of every node to a single community. We would often prefer to describe node behaviour in a more continuous way. In a seminal paper, Hoff et al. (2002) considered a number of latent position models where, in abstract terms, each node i is mapped to a point X
Evaluating Data Science Projects: A Case Study Critique
I've written two blog posts on evaluation--the broccoli of machine learning. Both types are important not only to data scientists but also to managers and executives, who must evaluate project proposals and results. To managers I would say: It's not necessary to understand the inner workings of a machine learning project, but you should understand whether the right things have been measured and whether the results are suited to the business problem. You need to know whether to believe what data scientists are telling you. To this end, here I'll evaluate a machine learning project report.
Text Compression for Sentiment Analysis via Evolutionary Algorithms
Dufourq, Emmanuel, Bassett, Bruce A.
Can textual data be compressed intelligently without losing accuracy in evaluating sentiment? In this study, we propose a novel evolutionary compression algorithm, PARSEC (PARts-of-Speech for sEntiment Compression), which makes use of Parts-of-Speech tags to compress text in a way that sacrifices minimal classification accuracy when used in conjunction with sentiment analysis algorithms. An analysis of PARSEC with eight commercial and non-commercial sentiment analysis algorithms on twelve English sentiment data sets reveals that accurate compression is possible with (0%, 1.3%, 3.3%) loss in sentiment classification accuracy for (20%, 50%, 75%) data compression with PARSEC using LingPipe, the most accurate of the sentiment algorithms. Other sentiment analysis algorithms are more severely affected by compression. We conclude that significant compression of text data is possible for sentiment analysis depending on the accuracy demands of the specific application and the specific sentiment analysis algorithm used.
WWE No Mercy 2017: Predictions, Match Card For 'Monday Night Raw' PPV
It's hard to remember a non-WrestleMania or SummerSlam pay-per-view that had two bigger matches than the ones headlining WWE No Mercy 2017 Sunday night. The card features Brock Lesnar vs. Braun Strowman and John Cena vs. Roman Reigns, both of which are WrestleMania-worthy matches. Below are predictions for every match on the WWE No Mercy card, which features wrestlers from "Monday Night Raw." It's time to put the strap on Strowman. Sure, he's gotten a big push by WWE, but his rise to the top of the card has also been an organic one. During a year in which every three-hour "Monday Night Raw" hasn't exactly been worth watching, Strowman has consistently been the best part of the show, going from a monster heel into maybe the most popular wrestler on the roster.
Practical Machine Learning Coursera
About this course: One of the most common tasks performed by data scientists and data analysts are prediction and machine learning. This course will cover the basic components of building and applying prediction functions with an emphasis on practical applications. The course will provide basic grounding in concepts such as training and tests sets, overfitting, and error rates. The course will also introduce a range of model based and algorithmic machine learning methods including regression, classification trees, Naive Bayes, and random forests. The course will cover the complete process of building prediction functions including data collection, feature creation, algorithms, and evaluation.
Model-Powered Conditional Independence Test
Sen, Rajat, Suresh, Ananda Theertha, Shanmugam, Karthikeyan, Dimakis, Alexandros G., Shakkottai, Sanjay
We consider the problem of non-parametric Conditional Independence testing (CI testing) for continuous random variables. Given i.i.d samples from the joint distribution $f(x,y,z)$ of continuous random vectors $X,Y$ and $Z,$ we determine whether $X \perp Y | Z$. We approach this by converting the conditional independence test into a classification problem. This allows us to harness very powerful classifiers like gradient-boosted trees and deep neural networks. These models can handle complex probability distributions and allow us to perform significantly better compared to the prior state of the art, for high-dimensional CI testing. The main technical challenge in the classification problem is the need for samples from the conditional product distribution $f^{CI}(x,y,z) = f(x|z)f(y|z)f(z)$ -- the joint distribution if and only if $X \perp Y | Z.$ -- when given access only to i.i.d. samples from the true joint distribution $f(x,y,z)$. To tackle this problem we propose a novel nearest neighbor bootstrap procedure and theoretically show that our generated samples are indeed close to $f^{CI}$ in terms of total variational distance. We then develop theoretical results regarding the generalization bounds for classification for our problem, which translate into error bounds for CI testing. We provide a novel analysis of Rademacher type classification bounds in the presence of non-i.i.d near-independent samples. We empirically validate the performance of our algorithm on simulated and real datasets and show performance gains over previous methods.
A constrained L1 minimization approach for estimating multiple Sparse Gaussian or Nonparanormal Graphical Models
Wang, Beilun, Singh, Ritambhara, Qi, Yanjun
Identifying context-specific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse Undirected Graphical Models (UGM) from aggregated samples across several contexts. Previous joint-UGM studies have mostly focused on sparse Gaussian Graphical Models (sGGMs) and can't identify context-specific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multi-UGM via a constrained L1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the L1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many real-world applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate O(log(Kp)/n_{tot}). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over state-of-the-art multi-sGGM and single-UGM baselines.
Dealing with unbalanced data in machine learning
In my last post, where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease prediction, I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data. But because I had gotten a few questions regarding this, I thought it would be worthwhile to explain over- and under-sampling techniques in more detail and show how you can very easily implement them with caret. In this context, unbalanced data refers to classification problems where we have unequal instances for different classes. Having unbalanced data is actually very common in general, but it is especially prevalent when working with disease data where we usually have more healthy control samples than disease cases.
Dealing with Unbalanced Classes in Machine Learning - deep ideas
In many real-world classification problems, we stumble upon training data with unbalanced classes. This means that the individual classes do not contain the same number of elements. For example, if we want to build an image-based skin cancer detection system using convolutional neural networks, we might encounter a dataset with about 95% negatives and 5% positives. This is for good reasons: Images associated with a negative diagnosis are way more common than images with a positive diagnosis. Rather than regarding this as a flaw in the dataset, we should leverage the additional information that we get.