A Geek's Guide to Machine Learning and Risk analytics and Decisioning Provenir


The greatest challenge when talking about artificial intelligence/machine learning is actually in understanding what data sets we are looking at, and what model/combination of models to apply. Amazon's Machine Learning offering is one example of an automated process which analyses the data and automatically selects the best model to use in the scenario. Other big players who have similar offerings are IBM Watson, Google and Microsoft. Provenir's clients are continually looking at new and innovative ways to improve their risk decisioning. Traditional banks offering consumer, SME and commercial loans and credit, auto lenders, payment providers and fintech companies are using Provenir technology to help them make faster and better decisions about potential fraud. Integrating artificial intelligence/machine learning capabilities into the risk decisioning process can increase the organization's ability to accurately assess the level of risk in order to detect and prevent fraud. Provenir provides model integration adaptors for machine learning models, including Amazon Machine Learning (AML) that can automatically listen for and label business-defined events, calculate attributes and update machine learning models. By combining Provenir technology with machine learning, organizations can increase both the efficiency and predictive accuracy of their risk decisioning.

Robust Synthetic Control

arXiv.org Machine Learning

We present a robust generalization of the synthetic control method for comparative case studies. Like the classical method, we present an algorithm to estimate the unobservable counterfactual of a treatment unit. A distinguishing feature of our algorithm is that of de-noising the data matrix via singular value thresholding, which renders our approach robust in multiple facets: it automatically identifies a good subset of donors, overcomes the challenges of missing data, and continues to work well in settings where covariate information may not be provided. To begin, we establish the condition under which the fundamental assumption in synthetic control-like approaches holds, i.e. when the linear relationship between the treatment unit and the donor pool prevails in both the pre- and post-intervention periods. We provide the first finite sample analysis for a broader class of models, the Latent Variable Model, in contrast to Factor Models previously considered in the literature. Further, we show that our de-noising procedure accurately imputes missing entries, producing a consistent estimator of the underlying signal matrix provided $p = \Omega( T^{-1 + \zeta})$ for some $\zeta > 0$; here, $p$ is the fraction of observed data and $T$ is the time interval of interest. Under the same setting, we prove that the mean-squared-error (MSE) in our prediction estimation scales as $O(\sigma^2/p + 1/\sqrt{T})$, where $\sigma^2$ is the noise variance. Using a data aggregation method, we show that the MSE can be made as small as $O(T^{-1/2+\gamma})$ for any $\gamma \in (0, 1/2)$, leading to a consistent estimator. We also introduce a Bayesian framework to quantify the model uncertainty through posterior probabilities. Our experiments, using both real-world and synthetic datasets, demonstrate that our robust generalization yields an improvement over the classical synthetic control method.

Network Detection Theory and Performance

arXiv.org Machine Learning

Network detection is an important capability in many areas of applied research in which data can be represented as a graph of entities and relationships. Oftentimes the object of interest is a relatively small subgraph in an enormous, potentially uninteresting background. This aspect characterizes network detection as a "big data" problem. Graph partitioning and network discovery have been major research areas over the last ten years, driven by interest in internet search, cyber security, social networks, and criminal or terrorist activities. The specific problem of network discovery is addressed as a special case of graph partitioning in which membership in a small subgraph of interest must be determined. Algebraic graph theory is used as the basis to analyze and compare different network detection methods. A new Bayesian network detection framework is introduced that partitions the graph based on prior information and direct observations. The new approach, called space-time threat propagation, is proved to maximize the probability of detection and is therefore optimum in the Neyman-Pearson sense. This optimality criterion is compared to spectral community detection approaches which divide the global graph into subsets or communities with optimal connectivity properties. We also explore a new generative stochastic model for covert networks and analyze using receiver operating characteristics the detection performance of both classes of optimal detection techniques.

BIRDNEST: Bayesian Inference for Ratings-Fraud Detection

arXiv.org Artificial Intelligence

Review fraud is a pervasive problem in online commerce, in which fraudulent sellers write or purchase fake reviews to manipulate perception of their products and services. Fake reviews are often detected based on several signs, including 1) they occur in short bursts of time; 2) fraudulent user accounts have skewed rating distributions. However, these may both be true in any given dataset. Hence, in this paper, we propose an approach for detecting fraudulent reviews which combines these 2 approaches in a principled manner, allowing successful detection even when one of these signs is not present. To combine these 2 approaches, we formulate our Bayesian Inference for Rating Data (BIRD) model, a flexible Bayesian model of user rating behavior. Based on our model we formulate a likelihood-based suspiciousness metric, Normalized Expected Surprise Total (NEST). We propose a linear-time algorithm for performing Bayesian inference using our model and computing the metric. Experiments on real data show that BIRDNEST successfully spots review fraud in large, real-world graphs: the 50 most suspicious users of the Flipkart platform flagged by our algorithm were investigated and all identified as fraudulent by domain experts at Flipkart.

Technosocial Modeling of IED Threat Scenarios and Attack

AAAI Conferences

This paper describes an approach for integrating sociological and technical models to develop more complete threat assessment. Current approaches to analyzing and addressing threats tend to focus on the technical factors. This paper addresses development of predictive models that encompass behavioral as well as these technical factors. Using improvised explosive device (IED) attacks as motivation, this model supports identification of intervention activities'left of boom' as well as prioritizing attack modalities. We show how Bayes nets integrate social factors associated with IED attacks into general threat model containing technical and organizational steps from planning through obtaining the IED to initiation of the attack. The models are computational representations of process models and associated social factors. When combined with technical models, the resulting model provides improved knowledge integration into threat assessment for monitoring.