AITopics

doi: 10.5121/ijaia.2010.1409

1011.0628

Country:

North America > United States (0.14)
Asia > India > Kerala (0.05)
Europe > Slovenia (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education > Focused Education > Special Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.93)
(2 more...)

arXiv.org Machine LearningOct-30-2010

Concentration inequalities of the cross-validation estimator for Empirical Risk Minimiser

Cornec, Matthieu

In this article, we derive concentration inequalities for the cross-validation estimate of the generalization error for empirical risk minimizers. In the general setting, we prove sanity-check bounds in the spirit of Kearns et al. (1999) "bounds showing that the worst-case error of this estimate is not much worse that of training error estimate ". General loss functions and class of predictors with finite VC-dimension are considered. We closely follow the formalism introduced by Dudoit et al. (2003) to cover a large variety of cross-validation procedures including leave-oneout cross-validation, k-fold cross-validation, holdout cross-validation (or split sample), and the leave-υ-out cross-validation. In particular, we focus on proving the consistency of the various cross-validation procedures. We point out the interest of each cross-validation procedure in terms of rate of convergence. An estimation curve with transition phases depending on the cross-validation procedure and not only on the percentage of observations in the test sample gives a simple rule on how to choose the cross-validation. An interesting consequence is that the size of the test sample is not required to grow to infinity for the consistency of the cross-validation procedure.

artificial intelligence, machine learning, test sample, (16 more...)

1011.0096

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (1.00)

Kloft, Marius, Brefeld, Ulf, Sonnenburg, Soeren, Zien, Alexander

Non-Sparse Regularization for Multiple Kernel Learning

arXiv.org Machine LearningOct-26-2010

Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this 1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, like p-norms with p>1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the commonly used wrapper approaches. A theoretical analysis and an experiment on controlled artificial data experiment sheds light on the appropriateness of sparse, non-sparse and $\ell_\infty$-norm MKL in various scenarios. Empirical applications of p-norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that go beyond the state-of-the-art.

artificial intelligence, data mining, machine learning, (15 more...)

1003.0079

Country:

North America > United States (1.00)
Europe > Germany > Baden-Württemberg (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)
(2 more...)

El-Yaniv, Ran, Nisenson, Mordechai

On the Foundations of Adversarial Single-Class Classification

arXiv.org Artificial IntelligenceOct-21-2010

Motivated by authentication, intrusion and spam detection applications we consider single-class classification (SCC) as a two-person game between the learner and an adversary. In this game the learner has a sample from a target distribution and the goal is to construct a classifier capable of distinguishing observations from the target distribution from observations emitted from an unknown other distribution. The ideal SCC classifier must guarantee a given tolerance for the false-positive error (false alarm rate) while minimizing the false negative error (intruder pass rate). Viewing SCC as a two-person zero-sum game we identify both deterministic and randomized optimal classification strategies for different game variants. We demonstrate that randomized classification can provide a significant advantage. In the deterministic setting we show how to reduce SCC to two-class classification where in the two-class problem the other class is a synthetically generated distribution. We provide an efficient and practical algorithm for constructing and solving the two class problem. The algorithm distinguishes low density regions of the target distribution and is shown to be consistent.

artificial intelligence, data mining, machine learning, (18 more...)

1010.4466

Country:

Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)
North America > United States > New York (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment > Games (1.00)
Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

El-Yaniv, Ran, Etzion-Rosenberg, Noam

Hierarchical Multiclass Decompositions with Application to Authorship Determination

arXiv.org Artificial IntelligenceOct-11-2010

This paper is mainly concerned with the question of how to decompose multiclass classification problems into binary subproblems. We extend known Jensen-Shannon bounds on the Bayes risk of binary problems to hierarchical multiclass problems and use these bounds to develop a heuristic procedure for constructing hierarchical multiclass decomposition for multinomials. We test our method and compare it to the well known "all-pairs" decomposition. Our tests are performed using a new authorship determination benchmark test of machine learning authors. The new method consistently outperforms the all-pairs decomposition when the number of classes is small and breaks even on larger multiclass problems. Using both methods, the classification accuracy we achieve, using an SVM over a feature set consisting of both high frequency single tokens and high frequency token-pairs, appears to be exceptionally high compared to known results in authorship determination.

artificial intelligence, decomposition, machine learning, (16 more...)

1010.2102

Country:

Asia > Middle East > Israel (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Ahdesmäki, Miika, Strimmer, Korbinian

Feature selection in omics prediction problems using cat scores and false nondiscovery rate control

arXiv.org Machine LearningOct-8-2010

We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted $t$-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James--Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package ``sda'' available from the R repository CRAN.

artificial intelligence, machine learning, selection, (17 more...)

doi: 10.1214/09-AOAS277

0903.2003

Country:

Europe > Germany (0.15)
Europe > Finland (0.14)

Genre: Research Report > Experimental Study (0.47)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.96)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Mordelet, Fantine, Vert, Jean-Philippe

A bagging SVM to learn from positive and unlabeled examples

arXiv.org Machine LearningOct-5-2010

In many applications, such as information retrieval or gene ranking, one is given a finite set of data of interest sharing a particular property, and wishes to find other data sharing the same property. In information retrieval, for example, the finite set can be a user query, or a set of documents known to belong to a specific category, and the goal is to scan a large database of documents to identify new documents related to the query or belonging to the same category. In gene ranking, the query is a finite list of genes known to have a given function or to be associated to a given disease, and the goal is to identify new genes sharing the same property (Aerts et al., 2006). In fact this setting is ubiquitous in many applications where identifying a data of interest is difficult or expensive, e.g., because human intervention is necessary or expensive experiments are needed, while unlabeled data can be easily collected. In such cases there is a clear opportunity to alleviate the burden and cost of interesting data identification with the help of machine learning techniques. More formally, let us assign a binary label to each possible data: positive ( 1) for data of interest, negative ( 1) for other data. Unlabeled data are data for which we do not know whether 1 they are interesting or not. Denoting X the set of data, we assume that the "query" is a finite set of data P {x

artificial intelligence, classifier, machine learning, (19 more...)

1010.0772

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.47)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Artificial IntelligenceSep-30-2010

A Comprehensive Survey of Data Mining-based Fraud Detection Research

Phua, Clifton, Lee, Vincent, Smith, Kate, Gayler, Ross

This survey paper categorises, compares, and summarises from almost all published technical and review articles in automated fraud detection within the last 10 years. It defines the professional fraudster, formalises the main types and subtypes of known fraud, and presents the nature of data evidence collected within affected industries. Within the business context of mining the data to achieve higher cost savings, this research presents methods and techniques together with their problems. Compared to all related reviews on fraud detection, this survey covers much more technical articles and is the only one, to the best of our knowledge, which proposes alternative data and solutions from related domains.

data mining, evolutionary algorithm, machine learning, (17 more...)

doi: 10.1016/j.chb.2012.01.002

1009.6119

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York (0.04)
North America > United States > Hawaii (0.04)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.93)

Industry:

Law Enforcement & Public Safety > Fraud (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (0.93)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Communications (1.00)
(10 more...)

Journal of Artificial Intelligence ResearchAug-26-2010

Cause Identification from Aviation Safety Incident Reports via Weakly Supervised Semantic Lexicon Construction

Abedin, M. A., Ng, V., Khan, L.

The Aviation Safety Reporting System collects voluntarily submitted reports on aviation safety incidents to facilitate research work aiming to reduce such incidents. To effectively reduce these incidents, it is vital to accurately identify why these incidents occurred. More precisely, given a set of possible causes, or shaping factors, this task of cause identification involves identifying all and only those shaping factors that are responsible for the incidents described in a report. We investigate two approaches to cause identification. Both approaches exploit information provided by a semantic lexicon, which is automatically constructed via Thelen and Riloff's Basilisk framework augmented with our linguistic and algorithmic modifications. The first approach labels a report using a simple heuristic, which looks for the words and phrases acquired during the semantic lexicon learning process in the report. The second approach recasts cause identification as a text classification problem, employing supervised and transductive text classification algorithms to learn models from incident reports labeled with shaping factors and using the models to label unseen reports. Our experiments show that both the heuristic-based approach and the learning-based approach (when given sufficient training data) outperform the baseline system significantly.

lexicon, semantic lexicon, word and phrase, (16 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.2986

AI Access Foundation

10662

Journal of Artificial Intelligence Research

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Illinois > Lake County > Waukegan (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Industry:

Transportation > Air (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(5 more...)

Islam, Md. Saiful, Khaled, Shah Mostafa, Farhan, Khalid, Rahman, Md. Abdur, Rahman, Joy

Modeling Spammer Behavior: Na\"ive Bayes vs. Artificial Neural Networks

arXiv.org Artificial IntelligenceAug-19-2010

Addressing the problem of spam emails in the Internet, this paper presents a comparative study on Na\"ive Bayes and Artificial Neural Networks (ANN) based modeling of spammer behavior. Keyword-based spam email filtering techniques fall short to model spammer behavior as the spammer constantly changes tactics to circumvent these filters. The evasive tactics that the spammer uses are themselves patterns that can be modeled to combat spam. It has been observed that both Na\"ive Bayes and ANN are best suitable for modeling spammer common patterns. Experimental results demonstrate that both of them achieve a promising detection rate of around 92%, which is considerably an improvement of performance compared to the keyword-based contemporary filtering approaches.

artificial intelligence, machine learning, spam filtering, (15 more...)

1008.3282

Country:

Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
Oceania > New Zealand > North Island > Waikato (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)
(2 more...)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Security & Privacy > Spam Filtering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)