Performance Analysis
The impossibility of "fairness": a generalized impossibility result for decisions
Various measures can be used to estimate bias or unfairness in a predictor. Previous work has already established that some of these measures are incompatible with each other. Here we show that, when groups differ in prevalence of the predicted event, several intuitive, reasonable measures of fairness (probability of positive prediction given occurrence or non-occurrence; probability of occurrence given prediction or non-prediction; and ratio of predictions over occurrences for each group) are all mutually exclusive: if one of them is equal among groups, the other two must differ. The only exceptions are for perfect, or trivial (always-positive or always-negative) predictors. As a consequence, any non-perfect, non-trivial predictor must necessarily be "unfair" under two out of three reasonable sets of criteria. This result readily generalizes to a wide range of well-known statistical quantities (sensitivity, specificity, false positive rate, precision, etc.), all of which can be divided into three mutually exclusive groups. Importantly, The results applies to all predictors, whether algorithmic or human. We conclude with possible ways to handle this effect when assessing and designing prediction methods.
Rates of Convergence of Spectral Methods for Graphon Estimation
This paper studies the problem of estimating the grahpon model - the underlying generating mechanism of a network. Graphon estimation arises in many applications such as predicting missing links in networks and learning user preferences in recommender systems. The graphon model deals with a random graph of $n$ vertices such that each pair of two vertices $i$ and $j$ are connected independently with probability $\rho \times f(x_i,x_j)$, where $x_i$ is the unknown $d$-dimensional label of vertex $i$, $f$ is an unknown symmetric function, and $\rho$ is a scaling parameter characterizing the graph sparsity. Recent studies have identified the minimax error rate of estimating the graphon from a single realization of the random graph. However, there exists a wide gap between the known error rates of computationally efficient estimation procedures and the minimax optimal error rate. Here we analyze a spectral method, namely universal singular value thresholding (USVT) algorithm, in the relatively sparse regime with the average vertex degree $n\rho=\Omega(\log n)$. When $f$ belongs to H\"{o}lder or Sobolev space with smoothness index $\alpha$, we show the error rate of USVT is at most $(n\rho)^{ -2 \alpha / (2\alpha+d)}$, approaching the minimax optimal error rate $\log (n\rho)/(n\rho)$ for $d=1$ as $\alpha$ increases. Furthermore, when $f$ is analytic, we show the error rate of USVT is at most $\log^d (n\rho)/(n\rho)$. In the special case of stochastic block model with $k$ blocks, the error rate of USVT is at most $k/(n\rho)$, which is larger than the minimax optimal error rate by at most a multiplicative factor $k/\log k$. This coincides with the computational gap observed for community detection. A key step of our analysis is to derive the eigenvalue decaying rate of the edge probability matrix using piecewise polynomial approximations of the graphon function $f$.
Boosting the accuracy of your Machine Learning models
An easy way of estimating the test error of a bagged model, without the need for cross-validation is Out-of-Bag Error Estimation. The observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. We can simply predict the response for the ith observation using each of the trees in which that observation was OOB. We average those predicted responses, or take a majority vote, depending on if the response is quantitative or qualitative. An overall OOB MSE(mean squared error) or classification error rate can be computed.
What causes predictive models to fail - and how to fix it?
Over-fitting.If you perform a regression with 200 predictors (with strong cross-correlations among predictors), use meta regression coefficients: that is, use coefficients of the form f[Corr(Var, Response), a,b, c] where a, b, c are three meta-parameters (e.g. This will reduce your number of parameters from 200 to 3, and eliminate most of the over-fitting Perform the right type of cross-validation. If your training set has 400,000 observations distributed across 50 clients, and your test data set (used for cross-validation) has 200,000 observations but only 3 clients or 5 days worth of historical data, then your cross-validation methodology is very flawed. Better, split your cross-validation data set in 5 subsets to compute confidence intervals. Make sure you've eliminated outliers and cleaned your data set.
14 Great Articles About Cross-Validation, Model Fitting and Selection
Cross-validation is a technique used to assess the accuracy of a predictive model, based on training set data. It splits the training sets into test and control sets. The test sets are used to fine-tune the model to increase performance (better classification rate or reduced errors in prediction) and the control sets are used to simulate how the model would perform outside the training set. The control and test sets must be carefully chosen for this method to make sense.
Reservoir of Diverse Adaptive Learners and Stacking Fast Hoeffding Drift Detection Methods for Evolving Data Streams
Pesaranghader, Ali, Viktor, Herna, Paquet, Eric
The last decade has seen a surge of interest in adaptive learning algorithms for data stream classification, with applications ranging from predicting ozone level peaks, learning stock market indicators, to detecting computer security violations. In addition, a number of methods have been developed to detect concept drifts in these streams. Consider a scenario where we have a number of classifiers with diverse learning styles and different drift detectors. Intuitively, the current 'best' (classifier, detector) pair is application dependent and may change as a result of the stream evolution. Our research builds on this observation. We introduce the $\mbox{Tornado}$ framework that implements a reservoir of diverse classifiers, together with a variety of drift detection algorithms. In our framework, all (classifier, detector) pairs proceed, in parallel, to construct models against the evolving data streams. At any point in time, we select the pair which currently yields the best performance. We further incorporate two novel stacking-based drift detection methods, namely the $\mbox{FHDDMS}$ and $\mbox{FHDDMS}_{add}$ approaches. The experimental evaluation confirms that the current 'best' (classifier, detector) pair is not only heavily dependent on the characteristics of the stream, but also that this selection evolves as the stream flows. Further, our $\mbox{FHDDMS}$ variants detect concept drifts accurately in a timely fashion while outperforming the state-of-the-art.
Feature selection in high-dimensional dataset using MapReduce
Reggiani, Claudio, Borgne, Yann-Aël Le, Bontempi, Gianluca
The exponential growth of data generation, measurements and collection in scientific and engineering disciplines leads to the availability of huge and high-dimensional datasets, in domains as varied as text mining, social network, astronomy or bioinformatics to name a few. The only viable path to the analysis of such datasets is to rely on data-intensive distributed computing frameworks [1]. MapReduce has in the last decade established itself as a reference programming model for distributed computing. The model is articulated around two main classes of functions, mappers and reducers, which greatly decrease the complexity of a distributed program while allowing to express a wide range of computing tasks. MapReduce was popularised by Google research in 2008 [2], and may be executed on parallel computing platforms ranging from specialised hardware units such as parallel field programmable gate arrays (FPGAs) and graphics processing units, to large clusters of commodity machine using for example the Hadoop or Spark frameworks [2]-[4]. In particular, the expressiveness of the MapReduce programming model has led to the design of advanced distributed data processing libraries for machine learning and data mining, such as Hadoop Mahout and Spark MLlib. Many of the standard supervised and unsupervised learning techniques (linear and logistic regression, naive Bayes, SVM, random forest, PCA) are now available from these libraries [5]-[7]. Little attention has however yet been given to feature selection algorithms (FSA), which form an essential component of machine learning and data mining workflows. Besides reducing a dataset size, FSA also generally allow to improve the performance of classification and regression models by selecting the most relevant features and reducing the noise in a dataset [8].
Visualizing Cross-validation Code
Let's visualize to improve your prediction... Let us say, you are writing a nice and clean Machine Learning code (e.g. You code is OK, first you divided your dataset into two parts, "Training Set and Testing Set" as usual with the function like train_test_split and with some random factor. Your prediction could be slightly under or overfit, like the figures below. As the name of the suggests, cross-validation is the next fun thing after learning Linear Regression because it helps to improve your prediction using the K-Fold strategy. What is K-Fold you asked?
Dealing With Imbalanced Datasets
Summary: Dealing with imbalanced datasets is an everyday problem. SMOTE, Synthetic Minority Oversampling TEchnique and its variants are techniques for solving this problem through oversampling that have recently become a very popular way to improve model performance. There are some problems that never go away. Imbalanced datasets is one in which the majority case greatly outweighs the minority case. Years ago we dealt with this by naïve oversampling or, if we had enough data, even under sampling to get the dataset more in balance.
Blockchains for Artificial Intelligence – The BigchainDB Blog
And, it was first published on Dataconomy on Dec 21, 2016; I'm reposting here for ease of access. In May 2017 I gave an updated talk; here's the slides & video.] In recent years, AI (artificial intelligence) researchers have finally cracked problems that they've worked on for decades, from Go to human-level speech recognition. A key piece was the ability to gather and learn on mountains of data, which pulled error rates past the success line. In short, big data has transformed AI, to an almost unreasonable level. Blockchain technology could transform AI too, in its own particular ways. Some applications of blockchains to AI are mundane, like audit trails on AI models. Some appear almost unreasonable, like AI that can own itself -- AI DAOs. All of them are opportunities. This article will explore these applications. Before we discuss applications, let's first review what's different about blockchains compared to traditional big-data distributed databases like MongoDB.