Accuracy
Machine Learning: You Gotta Tame the Beast Before You Let It Out of Its Cage
Machine learning is a fashionable buzzword right now in infosec, and is often referenced as the key to next-gen, signature-less security. But along with all of the hype and buzz, there also is a mind-blowing amount of misunderstanding surrounding machine learning in infosec. Machine learning isn't a silver bullet for all information security problems, and in fact can be detrimental if misinterpreted. For example, company X claims to block 99% of all malware, or company Y's intrusion detection will stop 99% of all attacks, yet customers see an overwhelming number of false positives. What do the accuracy numbers really mean? In fact, these simple statistics lose meaning without the proper context.
Lightweight Random Indexing for Polylingual Text Classification
Moreo Fernรกndez, Alejandro, Esuli, Andrea, Sebastiani, Fabrizio
Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the |L| monolingual classifiers by also leveraging the training documents written in the other (|L| โ 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly |L| times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-translation-free and dictionary-free PLTC methods that we use as baselines.
Post Selection Inference with Kernels
Yamada, Makoto, Umezu, Yuta, Fukumizu, Kenji, Takeuchi, Ichiro
We propose a novel kernel based post selection inference (PSI) algorithm, which can not only handle non-linearity in data but also structured output such as multi-dimensional and multi-label outputs. Specifically, we develop a PSI algorithm for independence measures, and propose the Hilbert-Schmidt Independence Criterion (HSIC) based PSI algorithm (hsicInf). The novelty of the proposed algorithm is that it can handle non-linearity and/or structured data through kernels. Namely, the proposed algorithm can be used for wider range of applications including nonlinear multi-class classification and multi-variate regressions, while existing PSI algorithms cannot handle them. Through synthetic experiments, we show that the proposed approach can find a set of statistically significant features for both regression and classification problems. Moreover, we apply the hsicInf algorithm to a real-world data, and show that hsicInf can successfully identify important features.
Using IBM Machine Learning to Help Solve Real World Business Problems
Billions of connected devices, zetabytes of data, power and brand loyalty now in the hands of the consumer, businesses having to market and sell to each and every one of us. How can any business make sense of it all? How can they learn and avoid making the same mistakes โ and become smarter. Oh โ and did I mention much of this needs to happen in real time? That's where Machine Leaning as part of a cognitive strategy comes in to its own.
Machines assess risk and detect fraud - Raconteur
A formal branch of artificial intelligence, machine-learning builds systems that learn directly from the data they are fed and effectively program themselves to analyse that data and make accurate predictions. Having already helped multiple business sectors create new models and drive competitive advantage, now it's the turn of the insurance industry. So just how is machine-learning changing the way insurers do business? "It gives insurers three distinct advantages," explains Max Richter, managing director in Accenture's UK insurance analytics group. "The first is to mine greater volumes of data, the second to scale analytics across the organisation by working smarter and faster, and lastly by answering more complex questions from'will this customer leave me at renewal?' to'what can I do about it?'" As such it is quickly becoming an essential tool for the insurance sector, specifically enabling companies to yield higher predictive accuracy as it can fit more flexible and complex models.
Novel biomarkers increase power to predict therapeutic response in lupus
Results of preclinical studies by investigators at the Medical University of South Carolina (MUSC) reported in the August 2016 issue of Arthritis & Rheumatology demonstrate for the first time that including novel biomarkers in lupus nephritis (LN) prognostic models significantly increases their power to predict therapeutic efficacy. Identifying biomarker models with sufficient predictive power is a critical step toward developing clinical decision-making tools that can rapidly identify patients who require a change in therapy and potentially reduce onset of renal fibrosis during induction therapy. Approximately half of all patients with systemic lupus erythematosus (SLE) develop LN, an immune complex-mediated glomerulonephritis. Lupus nephritis, in turn, leads to renal failure in up to 50% of patients within five years. American College of Rheumatology guidelines recommend changing LN treatment after six months of induction therapy if response to therapy is not achieved.
Estimating mutual information in high dimensions via classification error
Zheng, Charles Y., Benjamini, Yuval
Multivariate pattern analyses approaches in neuroimaging are fundamentally concerned with investigating the quantity and type of information processed by various regions of the human brain; typically, estimates of classification accuracy are used to quantify information. While a extensive and powerful library of methods can be applied to train and assess classifiers, it is not always clear how to use the resulting measures of classification performance to draw scientific conclusions: e.g. for the purpose of evaluating redundancy between brain regions. An additional confound for interpreting classification performance is the dependence of the error rate on the number and choice of distinct classes obtained for the classification task. In contrast, mutual information is a quantity defined independently of the experimental design, and has ideal properties for comparative analyses. Unfortunately, estimating the mutual information based on observations becomes statistically infeasible in high dimensions without some kind of assumption or prior. In this paper, we construct a novel classification-based estimator of mutual information based on high-dimensional asymptotics. We show that in a particular limiting regime, the mutual information is an invertible function of the expected $k$-class Bayes error. While the theory is based on a large-sample, high-dimensional limit, we demonstrate through simulations that our proposed estimator has superior performance to the alternatives in problems of moderate dimensionality.
Confusion matrix - Wikipedia, the free encyclopedia
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix,[4] is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa).[2] The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table). If a classification system has been trained to distinguish between cats, dogs and rabbits, a confusion matrix will summarize the results of testing the algorithm for further inspection.
Model evaluation, model selection, and algorithm selection in machine learning
Almost every machine learning algorithm comes with a large number of settings that we, the machine learning researchers and practitioners, need to specify. These tuning knobs, the so-called hyperparameters, help us control the behavior of machine learning algorithms when optimizing for performance, finding the right balance between bias and variance. Hyperparameter tuning for performance optimization is an art in itself, and there are no hard-and-fast rules that guarantee best performance on a given dataset. In Part I and Part II, we saw different holdout and bootstrap techniques for estimating the generalization performance of a model. We learned about the bias-variance trade-off, and we computed the uncertainty of our estimates. In this third part, we will focus on different methods of cross-validation for model evaluation and model selection. We will use these cross-validation techniques to rank models from several hyperparameter configurations and estimate how well they generalize to independent datasets. Previously, we used the holdout method or different flavors of bootstrapping to estimate the generalization performance of our predictive models.
WWE No Mercy 2016: Match Card, Predictions For SmackDown PPV
The year's second pay-per-view featuring only wrestlers from "SmackDown" is scheduled for Sunday night in Sacramento. WWE No Mercy 2016 will include four championship matches, and a few belts seem likely to change hands. AJ Styles will defend the show's top title on a PPV for the first time. Styles won the championship from Ambrose with a low blow at WWE BackLash in September, and Cena has not held the belt since he lost it to Brock Lesnar at SummerSlam 2014. The Miz is the longest-reigning singles champion in WWE, but he might not have the belt for much longer.