Accuracy
Machine Leaning Getting More Like Human Learning
Machine learning surveillance techniques used to spot unfair or collusive activity at firms have advanced far beyond searching lists for risky words, and now make myriad connections based on a wide range of behavior and interactions. Tim Estes, CEO of Digital Reasoning, which provides advanced compliance tools of this sort, said to trick the surveillance software would mean avoiding every system -- email, chat rooms etc. -- and not doing anything behaviorally that would create a signal either. "If you think about it, it's very hard to be someone that you are not all the time consistently. We have a word for those people -- they are called sociopaths," Estes said. Digital Reasoning is a cognitive computing company focused on applications that leverage human communication data.
Provable Self-Representation Based Outlier Detection in a Union of Subspaces
You, Chong, Robinson, Daniel P., Vidal, Renรฉ
Many computer vision tasks involve processing large amounts of data contaminated by outliers, which need to be detected and rejected. While outlier detection methods based on robust statistics have existed for decades, only recently have methods based on sparse and low-rank representation been developed along with guarantees of correct outlier detection when the inliers lie in one or more low-dimensional subspaces. This paper proposes a new outlier detection method that combines tools from sparse representation with random walks on a graph. By exploiting the property that data points can be expressed as sparse linear combinations of each other, we obtain an asymmetric affinity matrix among data points, which we use to construct a weighted directed graph. By defining a suitable Markov Chain from this graph, we establish a connection between inliers/outliers and essential/inessential states of the Markov chain, which allows us to detect outliers by using random walks. We provide a theoretical analysis that justifies the correctness of our method under geometric and connectivity assumptions. Experimental results on image databases demonstrate its superiority with respect to state-of-the-art sparse and low-rank outlier detection methods.
Approximate Kernel-based Conditional Independence Tests for Fast Non-Parametric Causal Discovery
Strobl, Eric V., Zhang, Kun, Visweswaran, Shyam
Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales cubicly with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. In practice, both of the proposed tests scale linearly with sample size and return accurate p-values much faster than KCIT in the large sample size context. CCD algorithms run with RCIT or RCoT also return graphs at least as accurate as the same algorithms run with KCIT but with large reductions in run time.
Optimized Data Pre-Processing for Discrimination Prevention
Calmon, Flavio P., Wei, Dennis, Ramamurthy, Karthikeyan Natesan, Varshney, Kush R.
Non-discrimination is a recognized objective in algorithmic decision making. In this paper, we introduce a novel probabilistic formulation of data pre-processing for reducing discrimination. We propose a convex optimization for learning a data transformation with three goals: controlling discrimination, limiting distortion in individual data samples, and preserving utility. We characterize the impact of limited sample size in accomplishing this objective, and apply two instances of the proposed optimization to datasets, including one on real-world criminal recidivism. The results demonstrate that all three criteria can be simultaneously achieved and also reveal interesting patterns of bias in American society.
Mixed Graphical Models for Causal Analysis of Multi-modal Variables
Sedgewick, Andrew J, Ramsey, Joseph D., Spirtes, Peter, Glymour, Clark, Benos, Panayiotis V.
Graphical causal models are an important tool for knowledge discovery because they can represent both the causal relations between variables and the multivariate probability distributions over the data. Once learned, causal graphs can be used for classification, feature selection and hypothesis generation, while revealing the underlying causal network structure and thus allowing for arbitrary likelihood queries over the data. However, current algorithms for learning sparse directed graphs are generally designed to handle only one type of data (continuous-only or discrete-only), which limits their applicability to a large class of multi-modal biological datasets that include mixed type variables. To address this issue, we developed new methods that modify and combine existing methods for finding undirected graphs with methods for finding directed graphs. These hybrid methods are not only faster, but also perform better than the directed graph estimation methods alone for a variety of parameter settings and data set sizes. Here, we describe a new conditional independence test for learning directed graphs over mixed data types and we compare performances of different graph learning strategies on synthetic data.
Interactive Graphics for Visually Diagnosing Forest Classifiers in R
da Silva, Natalia, Cook, Dianne, Lee, Eun-Kyung
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble, produced by bagging multiple trees. The process of bagging and combining results from multiple trees, produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm, and to the projection pursuit forest, but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R, using the ggplot2, plotly, and shiny packages.
Optimal change point detection in Gaussian processes
Keshavarz, Hossein, Scott, Clayton, Nguyen, XuanLong
We study the problem of detecting a change in the mean of one-dimensional Gaussian process data. This problem is investigated in the setting of increasing domain (customarily employed in time series analysis) and in the setting of fixed domain (typically arising in spatial data analysis). We propose a detection method based on the generalized likelihood ratio test (GLRT), and show that our method achieves nearly asymptotically optimal rate in the minimax sense, in both settings. The salient feature of the proposed method is that it exploits in an efficient way the data dependence captured by the Gaussian process covariance structure. When the covariance is not known, we propose the plug-in GLRT method and derive conditions under which the method remains asymptotically near optimal. By contrast, the standard CUSUM method, which does not account for the covariance structure, is shown to be asymptotically optimal only in the increasing domain. Our algorithms and accompanying theory are applicable to a wide variety of covariance structures, including the Matern class, the powered exponential class, and others. The plug-in GLRT method is shown to perform well for maximum likelihood estimators with a dense covariance matrix.
An Efficient Pseudo-likelihood Method for Sparse Binary Pairwise Markov Network Estimation
Geng, Sinong, Kuang, Zhaobin, Page, David
The pseudo-likelihood method is one of the most popular algorithms for learning sparse binary pairwise Markov networks. In this paper, we formulate the $L_1$ regularized pseudo-likelihood problem as a sparse multiple logistic regression problem. In this way, many insights and optimization procedures for sparse logistic regression can be applied to the learning of discrete Markov networks. Specifically, we use the coordinate descent algorithm for generalized linear models with convex penalties, combined with strong screening rules, to solve the pseudo-likelihood problem with $L_1$ regularization. Therefore a substantial speedup without losing any accuracy can be achieved. Furthermore, this method is more stable than the node-wise logistic regression approach on unbalanced high-dimensional data when penalized by small regularization parameters. Thorough numerical experiments on simulated data and real world data demonstrate the advantages of the proposed method.
Causality on Longitudinal Data: Stable Specification Search in Constrained Structural Equation Modeling
Rahmadi, Ridho, Groot, Perry, van Rijn, Marieke HC, Brand, Jan AJG van den, Heins, Marianne, Knoop, Hans, Heskes, Tom
A typical problem in causal modeling is the instability of model structure learning, i.e., small changes in finite data can result in completely different optimal models. The present work introduces a novel causal modeling algorithm for longitudinal data, that is robust for finite samples based on recent advances in stability selection using subsampling and selection algorithms. Our approach uses exploratory search but allows incorporation of prior knowledge, e.g., the absence of a particular causal relationship between two specific variables. We represent causal relationships using structural equation models. Models are scored along two objectives: the model fit and the model complexity. Since both objectives are often conflicting we apply a multi-objective evolutionary algorithm to search for Pareto optimal models. To handle the instability of small finite data samples, we repeatedly subsample the data and select those substructures (from the optimal models) that are both stable and parsimonious. These substructures can be visualized through a causal graph. Our more exploratory approach achieves at least comparable performance as, but often a significant improvement over state-of-the-art alternative approaches on a simulated data set with a known ground truth. We also present the results of our method on three real-world longitudinal data sets on chronic fatigue syndrome, Alzheimer disease, and chronic kidney disease. The findings obtained with our approach are generally in line with results from more hypothesis-driven analyses in earlier studies and suggest some novel relationships that deserve further research.
Carbon Black warns that artificial intelligence is not a silver bullet
The research, which Carbon Black says looked "Beyond the Hype" found that the roles of AI and ML in preventing cyber-attacks have been met with both hope and skepticism. The vast majority (93 percent) of the 400 security researchers interviewed while conducting this research said non-malware attacks pose more of a business risk than commodity malware attacks, and more importantly that these are often not stopped by traditional anti-virus offerings. Mike Viscuso, co-founder and CTO of Carbon Black told SC Media UK: "Researchers have reported seeing an increase in the number, and sophistication, of non-malware attacks. These attacks are specifically designed to evade file-based prevention mechanisms and leverage native operating system tools to keep attackers under the radar." One respondent explained: "Most users seem to be familiar with the idea that their computer or network may have accidentally become infected with a virus, but rarely consider a person who is actually attacking them in a more proactive and targeted manner."