Goto

Collaborating Authors

 Performance Analysis


New IBM security tool uses machine learning to help businesses detect phishing - TechRepublic

#artificialintelligence

A new machine-learning based security solution from IBM could help businesses detect phishing sites up to 250% faster than other methods. Announced via a blog post on Monday, the cognitive phishing detection feature is part of the IBM Security Trusteer platform. When it comes to hacking, phishing is one of the oldest tricks in the book. It has stayed around for so long, in part, because it still works. According to IBM Security research cited in the post, some 30% of phishing emails are opened by targeted recipients. Phishing works well because it capitalizes on the fact that humans are typically the weakest link in an organization's cybersecurity.


Gradient-based Regularization Parameter Selection for Problems with Non-smooth Penalty Functions

arXiv.org Machine Learning

In high-dimensional and/or non-parametric regression problems, regularization (or penalization) is used to control model complexity and induce desired structure. Each penalty has a weight parameter that indicates how strongly the structure corresponding to that penalty should be enforced. Typically the parameters are chosen to minimize the error on a separate validation set using a simple grid search or a gradient-free optimization method. It is more efficient to tune parameters if the gradient can be determined, but this is often difficult for problems with non-smooth penalty functions. Here we show that for many penalized regression problems, the validation loss is actually smooth almost-everywhere with respect to the penalty parameters. We can therefore apply a modified gradient descent algorithm to tune parameters. Through simulation studies on example regression problems, we find that increasing the number of penalty parameters and tuning them using our method can decrease the generalization error.


Deep scattering transform applied to note onset detection and instrument recognition

arXiv.org Machine Learning

Automatic Music Transcription (AMT) is one of the oldest and most well-studied problems in the field of music information retrieval. Within this challenging research field, onset detection and instrument recognition take important places in transcription systems, as they respectively help to determine exact onset times of notes and to recognize the corresponding instrument sources. The aim of this study is to explore the usefulness of multiscale scattering operators for these two tasks on plucked string instrument and piano music. After resuming the theoretical background and illustrating the key features of this sound representation method, we evaluate its performances comparatively to other classical sound representations. Using both MIDI-driven datasets with real instrument samples and real musical pieces, scattering is proved to outperform other sound representations for these AMT subtasks, putting forward its richer sound representation and invariance properties.


Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes

arXiv.org Artificial Intelligence

Datasets with hundreds of variables and many missing values are commonplace. In this setting, it is both statistically and computationally challenging to detect true predictive relationships between variables and also to suppress false positives. This paper proposes an approach that combines probabilistic programming, information theory, and non-parametric Bayes. It shows how to use Bayesian non-parametric modeling to (i) build an ensemble of joint probability models for all the variables; (ii) efficiently detect marginal independencies; and (iii) estimate the conditional mutual information between arbitrary subsets of variables, subject to a broad class of constraints. Users can access these capabilities using BayesDB, a probabilistic programming platform for probabilistic data analysis, by writing queries in a simple, SQL-like language. This paper demonstrates empirically that the method can (i) detect context-specific (in)dependencies on challenging synthetic problems and (ii) yield improved sensitivity and specificity over baselines from statistics and machine learning, on a real-world database of over 300 sparsely observed indicators of macroeconomic development and public health.


Additive Models with Trend Filtering

arXiv.org Machine Learning

We consider additive models built with trend filtering, i.e., additive models whose components are each regularized by the (discrete) total variation of their $(k+1)$st (discrete) derivative, for a chosen integer $k \geq 0$. This results in $k$th degree piecewise polynomial components, (e.g., $k=0$ gives piecewise constant components, $k=1$ gives piecewise linear, $k=2$ gives piecewise quadratic, etc.). In univariate nonparametric regression, the localized nature of the total variation regularizer used by trend filtering has been shown to produce estimates with superior local adaptivity to those from smoothing splines (and linear smoothers, more generally) (Tibshirani [2014]). Further, the structured nature of this regularizer has been shown to lead to highly efficient computational routines for trend filtering (Kim et al. [2009], Ramdas and Tibshirani [2016]). In this paper, we argue that both of these properties carry over to the additive models setting. We derive fast error rates for additive trend filtering estimates, and prove that these rates are minimax optimal when the underlying function is itself additive and has component functions whose derivatives are of bounded variation. We show that such rates are unattainable by additive smoothing splines (and by additive models built from linear smoothers, in general). We argue that backfitting provides an efficient algorithm for additive trend filtering, as it is built around the fast univariate trend filtering solvers; moreover, we describe a modified backfitting procedure whose iterations can be run in parallel. Finally, we conduct experiments to examine the empirical properties of additive trend filtering, and outline some possible extensions.


How Big Data is Redefining the Banking and Financial Industry

#artificialintelligence

In the face of unmetered innovation across multiple industries, the banking industry has been rather quiet. For centuries, the banking industry has gone unscathed by the unrelenting tides of change. People still queue in banks to perform the simplest transactions. Appending the wrong signature on a check or form can lock you out of your bank account - or at the very least, turn your day into a nightmare. Thankfully, there are positive signs that the industry is slowly undergoing a transformation.


"Influence Sketching": Finding Influential Samples In Large-Scale Regressions

arXiv.org Machine Learning

There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.


WrestleMania 33 Card Up To 12 Matches With Latest Addition To WWE 2017 PPV

International Business Times

Less than two weeks away from WrestleMania 33, the number of matches officially on the card is up to 12. The latest added to the biggest WWE pay-per-view of 2017 is the Intercontinental Championship Match between Dean Ambrose and Baron Corbin. The match joined the list Tuesday night when Ambrose accepted Corbin's challenge on "SmackDown Live." Ambrose distracted Corbin during the Lone Wolf's match with Randy Orton, causing him to get hit with an RKO and suffer the loss. Ambrose ran down to the ring and delivered a Dirty Deeds for good measure.


Random Forests for Big Data

arXiv.org Machine Learning

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.


Perspective: Energy Landscapes for Machine Learning

arXiv.org Machine Learning

Machine learning techniques are being increasingly used as flexible non-linear fitting and prediction tools in the physical sciences. Fitting functions that exhibit multiple solutions as local minima can be analysed in terms of the corresponding machine learning landscape. Methods to explore and visualise molecular potential energy landscapes can be applied to these machine learning landscapes to gain new insight into the solution space involved in training and the nature of the corresponding predictions. In particular, we can define quantities analogous to molecular structure, thermodynamics, and kinetics, and relate these emergent properties to the structure of the underlying landscape. This Perspective aims to describe these analogies with examples from recent applications, and suggest avenues for new interdisciplinary research.