AITopics | Performance Analysis

Collaborating Authors

Performance Analysis

News Overviews Instructional Materials AI-Alerts Classics

GraphPrints: Towards a Graph Analytic Method for Network Anomaly Detection

Harshaw, Christopher R., Bridges, Robert A., Iannacone, Michael D., Reed, Joel W., Goodall, John R.

arXiv.org Machine LearningFeb-2-2016

This paper introduces a novel graph-analytic approach for detecting anomalies in network flow data called GraphPrints. Building on foundational network-mining techniques, our method represents time slices of traffic as a graph, then counts graphlets -- small induced subgraphs that describe local topology. By performing outlier detection on the sequence of graphlet counts, anomalous intervals of traffic are identified, and furthermore, individual IPs experiencing abnormal behavior are singled-out. Initial testing of GraphPrints is performed on real network data with an implanted anomaly. Evaluation shows false positive rates bounded by 2.84% at the time-interval level, and 0.05% at the IP-level with 100% true positive rates at both.

data mining, machine learning, traffic, (18 more...)

arXiv.org Machine Learning

doi: 10.1145/1235

1602.0113

Country: North America > United States (1.00)

Genre: Research Report (0.83)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.94)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Early Predictions of Movie Success: the Who, What, and When of Profitability

Lash, Michael T., Zhao, Kang

arXiv.org Artificial IntelligenceJan-29-2016

This paper proposes a decision support system to aid movie investment decisions at the early stage of movie productions. The system predicts the success of a movie based on its profitability by leveraging historical data from various sources. Using social network analysis and text mining techniques, the system automatically extracts several groups of features, including "who" are on the cast, "what" a movie is about, "when" a movie will be released, as well as "hybrid" features that match "who" with "what", and "when" with "what". Experiment results with movies during an 11-year period showed that the system outperforms benchmark methods by a large margin in predicting movie profitability. Novel features we proposed also made great contributions to the prediction. In addition to designing a decision support system with practical utilities, our analysis of key factors for movie profitability may also have implications for theoretical research on team performance and the success of creative work.

data mining, decision support system, machine learning, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1080/07421222.2016.1243969

1506.05382

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Decision Support Systems (1.00)
Information Technology > Data Science > Data Mining (1.00)
(4 more...)

Add feedback

Sparse Generalized Principal Component Analysis for Large-scale Applications beyond Gaussianity

Zhang, Qiaoya, She, Yiyuan

arXiv.org Machine LearningJan-27-2016

Principal Component Analysis (PCA) is a dimension reduction technique. It produces inconsistent estimators when the dimensionality is moderate to high, which is often the problem in modern large-scale applications where algorithm scalability and model interpretability are difficult to achieve, not to mention the prevalence of missing values. While existing sparse PCA methods alleviate inconsistency, they are constrained to the Gaussian assumption of classical PCA and fail to address algorithm scalability issues. We generalize sparse PCA to the broad exponential family distributions under high-dimensional setup, with built-in treatment for missing values. Meanwhile we propose a family of iterative sparse generalized PCA (SG-PCA) algorithms such that despite the non-convexity and non-smoothness of the optimization task, the loss function decreases in every iteration. In terms of ease and intuitive parameter tuning, our sparsity-inducing regularization is far superior to the popular Lasso. Furthermore, to promote overall scalability, accelerated gradient is integrated for fast convergence, while a progressive screening technique gradually squeezes out nuisance dimensions of a large-scale problem for feasible optimization. High-dimensional simulation and real data experiments demonstrate the efficiency and efficacy of SG-PCA.

algorithm, dimension, sg-pca, (15 more...)

arXiv.org Machine Learning

1512.03883

Country:

North America > United States > Florida > Leon County > Tallahassee (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Principal Component Analysis (0.61)

Add feedback

Font Identification in Historical Documents Using Active Learning

Gupta, Anshul, Gutierrez-Osuna, Ricardo, Christy, Matthew, Furuta, Richard, Mandell, Laura

arXiv.org Machine LearningJan-26-2016

Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.

artificial intelligence, machine learning, pattern recognition, (16 more...)

arXiv.org Machine Learning

1601.07252

Country: North America > United States > Texas (0.15)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Supersparse Linear Integer Models for Optimized Medical Scoring Systems

Ustun, Berk, Rudin, Cynthia

arXiv.org Machine LearningJan-26-2016

Scoring systems are linear classification models that only require users to add, subtract and multiply a few small numbers in order to make a prediction. These models are in widespread use by the medical community, but are difficult to learn from data because they need to be accurate and sparse, have coprime integer coefficients, and satisfy multiple operational constraints. We present a new method for creating data-driven scoring systems called a Supersparse Linear Integer Model (SLIM). SLIM scoring systems are built by solving an integer program that directly encodes measures of accuracy (the 0-1 loss) and sparsity (the $\ell_0$-seminorm) while restricting coefficients to coprime integers. SLIM can seamlessly incorporate a wide range of operational constraints related to accuracy and sparsity, and can produce highly tailored models without parameter tuning. We provide bounds on the testing and training accuracy of SLIM scoring systems, and present a new data reduction technique that can improve scalability by eliminating a portion of the training data beforehand. Our paper includes results from a collaboration with the Massachusetts General Hospital Sleep Laboratory, where SLIM was used to create a highly tailored scoring system for sleep apnea screening

artificial intelligence, constraint, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1007/s10994-015-5528-6

1502.04269

Country: North America > United States > Massachusetts (0.34)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Conditional distribution variability measures for causality detection

Fonollosa, José A. R.

arXiv.org Machine LearningJan-25-2016

In this paper we derive variability measures for the conditional probability distributions of a pair of random variables, and we study its application in the inference of causal-effect relationships. We also study the combination of the proposed measures with standard statistical measures in the the framework of the ChaLearn cause-effect pair challenge. The developed model obtains an AUC score of 0.82 on the final test database and ranked second in the challenge.

artificial intelligence, conditional distribution variability measure, machine learning, (13 more...)

arXiv.org Machine Learning

1601.0668

Country: North America > United States > New York (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Learning Minimum Volume Sets and Anomaly Detectors from KNN Graphs

Root, Jonathan, Saligrama, Venkatesh, Qian, Jing

arXiv.org Machine LearningJan-22-2016

We propose a non-parametric anomaly detection algorithm for high dimensional data. We first rank scores derived from nearest neighbor graphs on $n$-point nominal training data. We then train limited complexity models to imitate these scores based on the max-margin learning-to-rank framework. A test-point is declared as an anomaly at $\alpha$-false alarm level if the predicted score is in the $\alpha$-percentile. The resulting anomaly detector is shown to be asymptotically optimal in that for any false alarm rate $\alpha$, its decision region converges to the $\alpha$-percentile minimum volume level set of the unknown underlying density. In addition, we test both the statistical performance and computational efficiency of our algorithm on a number of synthetic and real-data experiments. Our results demonstrate the superiority of our algorithm over existing $K$-NN based anomaly detection algorithms, with significant computational savings.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1601.06105

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

A Framework for Individualizing Predictions of Disease Trajectories by Exploiting Multi-Resolution Structure

Schulam, Peter, Saria, Suchi

arXiv.org Machine LearningJan-21-2016

For many complex diseases, there is a wide variety of ways in which an individual can manifest the disease. The challenge of personalized medicine is to develop tools that can accurately predict the trajectory of an individual's disease, which can in turn enable clinicians to optimize treatments. We represent an individual's disease trajectory as a continuous-valued continuous-time function describing the severity of the disease over time. We propose a hierarchical latent variable model that individualizes predictions of disease trajectories. This model shares statistical strength across observations at different resolutions--the population, subpopulation and the individual level. We describe an algorithm for learning population and subpopulation parameters offline, and an online procedure for dynamically learning individual-specific parameters. Finally, we validate our model on the task of predicting the course of interstitial lung disease, a leading cause of death among patients with the autoimmune disease scleroderma. We compare our approach against state-of-the-art and demonstrate significant improvements in predictive accuracy.

artificial intelligence, machine learning, prediction, (16 more...)

arXiv.org Machine Learning

1601.04674

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.88)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Sparse Recovery via Differential Inclusions

Osher, Stanley, Ruan, Feng, Xiong, Jiechao, Yao, Yuan, Yin, Wotao

arXiv.org Machine LearningJan-21-2016

In this paper, we recover sparse signals from their noisy linear measurements by solving nonlinear differential inclusions, which is based on the notion of inverse scale space (ISS) developed in applied mathematics. Our goal here is to bring this idea to address a challenging problem in statistics, \emph{i.e.} finding the oracle estimator which is unbiased and sign-consistent using dynamics. We call our dynamics \emph{Bregman ISS} and \emph{Linearized Bregman ISS}. A well-known shortcoming of LASSO and any convex regularization approaches lies in the bias of estimators. However, we show that under proper conditions, there exists a bias-free and sign-consistent point on the solution paths of such dynamics, which corresponds to a signal that is the unbiased estimate of the true signal and whose entries have the same signs as those of the true signs, \emph{i.e.} the oracle estimator. Therefore, their solution paths are regularization paths better than the LASSO regularization path, since the points on the latter path are biased when sign-consistency is reached. We also show how to efficiently compute their solution paths in both continuous and discretized settings: the full solution paths can be exactly computed piece by piece, and a discretization leads to \emph{Linearized Bregman iteration}, which is a simple iterative thresholding rule and easy to parallelize. Theoretical guarantees such as sign-consistency and minimax optimal $l_2$-error bounds are established in both continuous and discrete settings for specific points on the paths. Early-stopping rules for identifying these points are given. The key treatment relies on the development of differential inequalities for differential inclusions and their discretizations, which extends the previous results and leads to exponentially fast recovering of sparse signals before selecting wrong ones.

artificial intelligence, bregman iss, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1016/j.acha.2016.01.002

1406.7728

Country: North America > United States > California (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Sensing and Signal Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Finding structure in data using multivariate tree boosting

Miller, Patrick J., Lubke, Gitta H., McArtor, Daniel B., Bergeman, C. S.

arXiv.org Machine LearningJan-21-2016

Technology and collaboration enable dramatic increases in the size of psychological and psychiatric data collections, but finding structure in these large data sets with many collected variables is challenging. Decision tree ensembles like random forests (Strobl, Malley, and Tutz, 2009) are a useful tool for finding structure, but are difficult to interpret with multiple outcome variables which are often of interest in psychology. To find and interpret structure in data sets with multiple outcomes and many predictors (possibly exceeding the sample size), we introduce a multivariate extension to a decision tree ensemble method called Gradient Boosted Regression Trees (Friedman, 2001). Our method, multivariate tree boosting, can be used for identifying important predictors, detecting predictors with non-linear effects and interactions without specification of such effects, and for identifying predictors that cause two or more outcome variables to covary without parametric assumptions. We provide the R package 'mvtboost' to estimate, tune, and interpret the resulting model, which extends the implementation of univariate boosting in the R package 'gbm' (Ridgeway, 2013) to continuous, multivariate outcomes. To illustrate the approach, we analyze predictors of psychological well-being (Ryff and Keyes, 1995). Simulations verify that our approach identifies predictors with non-linear effects and achieves high prediction accuracy, exceeding or matching the performance of (penalized) multivariate multiple regression and multivariate decision trees over a wide range of conditions.

artificial intelligence, machine learning, predictor, (18 more...)

arXiv.org Machine Learning

doi: 10.1037/met0000087

1511.02025

Country: North America > United States > New York (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback