AITopics | Statistical Learning

Collaborating Authors

Statistical Learning

News Overviews Instructional Materials AI-Alerts Classics

A bagging SVM to learn from positive and unlabeled examples

arXiv.org Machine LearningOct-5-2010

In many applications, such as information retrieval or gene ranking, one is given a finite set of data of interest sharing a particular property, and wishes to find other data sharing the same property. In information retrieval, for example, the finite set can be a user query, or a set of documents known to belong to a specific category, and the goal is to scan a large database of documents to identify new documents related to the query or belonging to the same category. In gene ranking, the query is a finite list of genes known to have a given function or to be associated to a given disease, and the goal is to identify new genes sharing the same property (Aerts et al., 2006). In fact this setting is ubiquitous in many applications where identifying a data of interest is difficult or expensive, e.g., because human intervention is necessary or expensive experiments are needed, while unlabeled data can be easily collected. In such cases there is a clear opportunity to alleviate the burden and cost of interesting data identification with the help of machine learning techniques. More formally, let us assign a binary label to each possible data: positive ( 1) for data of interest, negative ( 1) for other data. Unlabeled data are data for which we do not know whether 1 they are interesting or not. Denoting X the set of data, we assume that the "query" is a finite set of data P {x

artificial intelligence, classifier, machine learning, (19 more...)

arXiv.org Machine Learning

1010.0772

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.47)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Trek separation for Gaussian graphical models

Sullivant, Seth, Talaska, Kelli, Draisma, Jan

arXiv.org Machine LearningOct-4-2010

Gaussian graphical models are semi-algebraic subsets of the cone of positive definite covariance matrices. Submatrices with low rank correspond to generalizations of conditional independence constraints on collections of random variables. We give a precise graph-theoretic characterization of when submatrices of the covariance matrix have small rank for a general class of mixed graphs that includes directed acyclic and undirected graphs as special cases. Our new trek separation criterion generalizes the familiar $d$-separation criterion. Proofs are based on the trek rule, the resulting matrix factorizations and classical theorems of algebraic combinatorics on the expansions of determinants of path polynomials.

artificial intelligence, graph, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1214/09-AOS760

0812.1938

Country: North America > United States > Michigan (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.85)

Add feedback

A Comprehensive Survey of Data Mining-based Fraud Detection Research

Phua, Clifton, Lee, Vincent, Smith, Kate, Gayler, Ross

arXiv.org Artificial IntelligenceSep-30-2010

This survey paper categorises, compares, and summarises from almost all published technical and review articles in automated fraud detection within the last 10 years. It defines the professional fraudster, formalises the main types and subtypes of known fraud, and presents the nature of data evidence collected within affected industries. Within the business context of mining the data to achieve higher cost savings, this research presents methods and techniques together with their problems. Compared to all related reviews on fraud detection, this survey covers much more technical articles and is the only one, to the best of our knowledge, which proposes alternative data and solutions from related domains.

data mining, evolutionary algorithm, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.chb.2012.01.002

1009.6119

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York (0.04)
North America > United States > Hawaii (0.04)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.93)

Industry:

Law Enforcement & Public Safety > Fraud (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (0.93)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Communications (1.00)
(10 more...)

Add feedback

Mantis: Predicting System Performance through Program Analysis and Modeling

Chun, Byung-Gon, Huang, Ling, Lee, Sangmin, Maniatis, Petros, Naik, Mayur

arXiv.org Artificial IntelligenceSep-30-2010

We present Mantis, a new framework that automatically predicts program performance with high accuracy. Mantis integrates techniques from programming language and machine learning for performance modeling, and is a radical departure from traditional approaches. Mantis extracts program features, which are information about program execution runs, through program instrumentation. It uses machine learning techniques to select features relevant to performance and creates prediction models as a function of the selected features. Through program analysis, it then generates compact code slices that compute these feature values for prediction. Our evaluation shows that Mantis can achieve more than 93% accuracy with less than 10% training data set, which is a significant improvement over models that are oblivious to program features. The system generates code slices that are cheap to compute feature values.

artificial intelligence, execution time, machine learning, (20 more...)

arXiv.org Artificial Intelligence

1010.0019

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Pair-Wise Cluster Analysis

Hardoon, David R., Pelcksman, Kristiaan

arXiv.org Machine LearningSep-18-2010

This paper studies the problem of learning clusters which are consistently present in different (continuously valued) representations of observed data. Our setup differs slightly from the standard approach of (co-) clustering as we use the fact that some form of `labeling' becomes available in this setup: a cluster is only interesting if it has a counterpart in the alternative representation. The contribution of this paper is twofold: (i) the problem setting is explored and an analysis in terms of the PAC-Bayesian theorem is presented, (ii) a practical kernel-based algorithm is derived exploiting the inherent relation to Canonical Correlation Analysis (CCA), as well as its extension to multiple views. A content based information retrieval (CBIR) case study is presented on the multi-lingual aligned Europal document dataset which supports the above findings.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

1009.3601

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Learning Latent Tree Graphical Models

Choi, Myung Jin, Tan, Vincent Y. F., Anandkumar, Animashree, Willsky, Alan S.

arXiv.org Machine LearningSep-14-2010

We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world datasets by modeling the dependency structure of monthly stock returns in the S&P index and of the words in the 20 newsgroups dataset.

algorithm, latent tree, node, (17 more...)

arXiv.org Machine Learning

1009.2722

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report (0.81)

Industry:

Information Technology (0.45)
Leisure & Entertainment (0.45)
Health & Medicine > Pharmaceuticals & Biotechnology (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.88)

Add feedback

Calibrated Surrogate Losses for Classification with Label-Dependent Costs

Scott, Clayton

arXiv.org Machine LearningSep-14-2010

We present surrogate regret bounds for arbitrary surrogate losses in the context of binary classification with label-dependent costs. Such bounds relate a classifier's risk, assessed with respect to a surrogate loss, to its cost-sensitive classification risk. Two approaches to surrogate regret bounds are developed. The first is a direct generalization of Bartlett et al. [2006], who focus on margin-based losses and cost-insensitive classification, while the second adopts the framework of Steinwart [2007] based on calibration functions. Nontrivial surrogate regret bounds are shown to exist precisely when the surrogate loss satisfies a "calibration" condition that is easily verified for many common losses. We apply this theory to the class of uneven margin losses, and characterize when these losses are properly calibrated. The uneven hinge, squared error, exponential, and sigmoid losses are then treated in detail.

classification, margin loss, surrogate regret, (15 more...)

arXiv.org Machine Learning

1009.2718

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Optimizing an Organized Modularity Measure for Topographic Graph Clustering: a Deterministic Annealing Approach

Rossi, Fabrice, Villa-Vialaneix, Nathalie

arXiv.org Machine LearningSep-7-2010

This paper proposes an organized generalization of Newman and Girvan's modularity measure for graph clustering. Optimized via a deterministic annealing scheme, this measure produces topologically ordered graph clusterings that lead to faithful and readable graph representations based on clustering induced graphs. Topographic graph clustering provides an alternative to more classical solutions in which a standard graph clustering method is applied to build a simpler graph that is then represented with a graph layout algorithm. A comparative study on four real world graphs ranging from 34 to 1 133 vertices shows the interest of the proposed approach with respect to classical solutions and to self-organizing maps for graphs.

artificial intelligence, graph, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1016/j.neucom.2009.11.023

1009.1318

Country: Europe > Austria (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Simple CW-SSIM Kernel-based Nearest Neighbor Method for Handwritten Digit Classification

Wang, Jiheng, Fan, Guangzhe, Wang, Zhou

arXiv.org Machine LearningSep-3-2010

We propose a simple kernel based nearest neighbor approach for handwritten digit classification. The "distance" here is actually a kernel defining the similarity between two images. We carefully study the effects of different number of neighbors and weight schemes and report the results. With only a few nearest neighbors (or most similar images) to vote, the test set error rate on MNIST database could reach about 1.5%-2.0%,

artificial intelligence, error rate, machine learning, (17 more...)

arXiv.org Machine Learning

1008.3951

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.47)

Industry: Education (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.82)

Add feedback

A PAC-Bayesian Analysis of Graph Clustering and Pairwise Clustering

Seldin, Yevgeny

arXiv.org Machine LearningSep-2-2010

We formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comparison of graph clustering with other possible ways to model the graph. We adapt the PAC-Bayesian analysis of co-clustering (Seldin and Tishby, 2008; Seldin, 2009) to derive a PAC-Bayesian generalization bound for graph clustering. The bound shows that graph clustering should optimize a trade-off between empirical data fit and the mutual information that clusters preserve on the graph nodes. A similar trade-off derived from information-theoretic considerations was already shown to produce state-of-the-art results in practice (Slonim et al., 2005; Yom-Tov and Slonim, 2009). This paper supports the empirical evidence by providing a better theoretical foundation, suggesting formal generalization guarantees, and offering a more accurate way to deal with finite sample issues. We derive a bound minimization algorithm and show that it provides good results in real-life problems and that the derived PAC-Bayesian bound is reasonably tight.

artificial intelligence, graph, machine learning, (16 more...)

arXiv.org Machine Learning

1009.0499

Country: Europe > Germany (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback