AITopics

The problem of multilabel classification when the labels are related through a hierarchical categorization scheme occurs in many application domains such as computational biology. For example, this problem arises naturally when trying to automatically assign gene function using a controlled vocabularies like Gene Ontology. However, most existing approaches for predicting gene functions solve independent classification problems to predict genes that are involved in a given function category, independently of the rest. Here, we propose two simple methods for incorporating information about the hierarchical nature of the categorization scheme. In the first method, we use information about a gene's previous annotation to set an initial prior on its label. In a second approach, we extend a graph-based semi-supervised learning algorithm for predicting gene function in a hierarchy. We show that we can efficiently solve this problem by solving a linear system of equations. We compare these approaches with a previous label reconciliation-based approach. Results show that using the hierarchy information directly, compared to using reconciliation methods, improves gene function prediction.

bioinformatics, category, machine learning, (20 more...)

1205.2622

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)

Rendle, Steffen, Freudenthaler, Christoph, Gantner, Zeno, Schmidt-Thieme, Lars

BPR: Bayesian Personalized Ranking from Implicit Feedback

Item recommendation is the task of predicting a personalized ranking on a set of items (e.g. websites, movies, products). In this paper, we investigate the most common scenario with implicit feedback (e.g. clicks, purchases). There are many methods for item recommendation from implicit feedback like matrix factorization (MF) or adaptive knearest-neighbor (kNN). Even though these methods are designed for the item prediction task of personalized ranking, none of them is directly optimized for ranking. In this paper we present a generic optimization criterion BPR-Opt for personalized ranking that is the maximum posterior estimator derived from a Bayesian analysis of the problem. We also provide a generic learning algorithm for optimizing models with respect to BPR-Opt. The learning method is based on stochastic gradient descent with bootstrap sampling. We show how to apply our method to two state-of-the-art recommender models: matrix factorization and adaptive kNN. Our experiments indicate that for the task of personalized ranking our optimization method outperforms the standard learning techniques for MF and kNN. The results show the importance of optimizing models for the right criterion.

artificial intelligence, machine learning, matrix factorization, (13 more...)

1205.2618

Country: North America > United States (0.29)

Genre: Research Report > New Finding (0.48)

Industry: Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)

Zhang, Lingsong, Zhu, Zhengyuan

Spatial Multiresolution Cluster Detection Method

A novel multi-resolution cluster detection (MCD) method is proposed to identify irregularly shaped clusters in space. Multi-scale test statistic on a single cell is derived based on likelihood ratio statistic for Bernoulli sequence, Poisson sequence and Normal sequence. A neighborhood variability measure is defined to select the optimal test threshold. The MCD method is compared with single scale testing methods controlling for false discovery rate and the spatial scan statistics using simulation and f-MRI data. The MCD method is shown to be more effective for discovering irregularly shaped clusters, and the implementation of this method does not require heavy computation, making it suitable for cluster detection for large spatial data.

artificial intelligence, machine learning, mcd method, (18 more...)

1205.2106

Country: North America > United States (0.67)

Genre: Research Report (0.82)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)

Reduced Rank Vector Generalized Linear Models for Feature Extraction

She, Yiyuan

Supervised linear feature extraction can be achieved by fitting a reduced rank multivariate model. This paper studies rank penalized and rank constrained vector generalized linear models. From the perspective of thresholding rules, we build a framework for fitting singular value penalized models and use it for feature extraction. Through solving the rank constraint form of the problem, we propose progressive feature space reduction for fast computation in high dimensions with little performance loss. A novel projective cross-validation is proposed for parameter tuning in such nonconvex setups. Real data applications are given to show the power of the methodology in supervised dimension reduction and feature extraction.

artificial intelligence, data mining, machine learning, (18 more...)

1007.3098

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Feature Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Haury, Anne-Claire, Mordelet, Fantine, Vera-Licona, Paola, Vert, Jean-Philippe

TIGRESS: Trustful Inference of Gene REgulation using Stability Selection

arXiv.org Machine LearningMay-6-2012

Inferring the structure of gene regulatory networks (GRN) from gene expression data has many applications, from the elucidation of complex biological processes to the identification of potential drug targets. It is however a notoriously difficult problem, for which the many existing methods reach limited accuracy. In this paper, we formulate GRN inference as a sparse regression problem and investigate the performance of a popular feature selection method, least angle regression (LARS) combined with stability selection. We introduce a novel, robust and accurate scoring technique for stability selection, which improves the performance of feature selection with LARS. The resulting method, which we call TIGRESS (Trustful Inference of Gene REgulation using Stability Selection), was ranked among the top methods in the DREAM5 gene network reconstruction challenge. We investigate in depth the influence of the various parameters of the method and show that a fine parameter tuning can lead to significant improvements and state-of-the-art performance for GRN inference. TIGRESS reaches state-of-the-art performance on benchmark data. This study confirms the potential of feature selection techniques for GRN inference. Code and data are available on http://cbio.ensmp.fr/~ahaury. Running TIGRESS online is possible on GenePattern: http://www.broadinstitute.org/cancer/software/genepattern/.

artificial intelligence, machine learning, regulation, (17 more...)

1205.1181

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Bennett, Casey, Doub, Tom, Selove, Rebecca

EHRs Connect Research and Practice: Where Predictive Modeling, Artificial Intelligence, and Clinical Decision Support Intersect

arXiv.org Machine LearningApr-22-2012

Objectives: Electronic health records (EHRs) are only a first step in capturing and utilizing health-related data - the challenge is turning that data into useful information. Furthermore, EHRs are increasingly likely to include data relating to patient outcomes, functionality such as clinical decision support, and genetic information as well, and, as such, can be seen as repositories of increasingly valuable information about patients' health conditions and responses to treatment over time. Methods: We describe a case study of 423 patients treated by Centerstone within Tennessee and Indiana in which we utilized electronic health record data to generate predictive algorithms of individual patient treatment response. Multiple models were constructed using predictor variables derived from clinical, financial and geographic data. Results: For the 423 patients, 101 deteriorated, 223 improved and in 99 there was no change in clinical condition. Based on modeling of various clinical indicators at baseline, the highest accuracy in predicting individual patient response ranged from 70-72% within the models tested. In terms of individual predictors, the Centerstone Assessment of Recovery Level - Adult (CARLA) baseline score was most significant in predicting outcome over time (odds ratio 4.1 + 2.27). Other variables with consistently significant impact on outcome included payer, diagnostic category, location and provision of case management services. Conclusions: This approach represents a promising avenue toward reducing the current gap between research and practice across healthcare, developing data-driven clinical decision support based on real-world populations, and serving as a component of embedded clinical artificial intelligences that "learn" over time.

bioinformatics, information, machine learning, (16 more...)

doi: 10.1016/j.hlpt.2012.03.001

1204.4927

Country:

North America > United States > Tennessee (0.35)
North America > United States > Indiana (0.34)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Biomedical Informatics > Clinical Informatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Allen, Genevera I., Peterson, Christine, Vannucci, Marina, Maletic-Savatic, Mirjana

Regularized Partial Least Squares with an Application to NMR Spectroscopy

arXiv.org Machine LearningApr-17-2012

Department of Statistics, Rice University Abstract High-dimensional data common in genomics, proteomics, and chemometrics often contains complicated correlation structures. Recently, partial least squares (PLS) and Sparse PLS methods have gained attention in these areas as dimension reduction techniques in the context of supervised data analysis. We introduce a framework for Regularized PLS by solving a relaxation of the SIMPLS optimization problem with penalties on the PLS loadings vectors. Our approach enjoys many advantages including flexibility, general penalties, easy interpretation of results, and fast computation in high-dimensional settings. We also outline extensions of our methods leading to novel methods for Nonnegative PLS and Generalized PLS, an adaption of PLS for structured data. We demonstrate the utility of our methods through simulations and a case study on proton Nuclear Magnetic Resonance (NMR) spectroscopy data. To whom correspondence should be addressed; Department of Statistics, Rice University, MS 138, 6100 Main St., Houston, TX 77005 (email: gallen@rice.edu) 1 Introduction Technologies to measure high-throughput biomedical data in proteomics, chemometrics, and genomics have led to a proliferation of high-dimensional data that pose many statistical challenges. As genes, proteins, and metabolites, are biologically interconnected, the variables in these data sets are often highly correlated. In this context, several have recently advocated using partial least squares (PLS) for dimension reduction of supervised data, or data with a response or labels (Nguyen and Rocke, 2002b; Boulesteix and Strimmer, 2007; Rossouw et al., 2008; Chun and Keleş, 2010). First introduced by Wold (1966) as a regression method that uses least squares on a set of derived inputs accounting for multi-colinearities, others have since proposed alternative methods for PLS with multiple responses (de Jong, 1993) and for classification (Marx, 1996; Barker and Rayens, 2003).

artificial intelligence, loading, machine learning, (18 more...)

1204.3942

Country: North America > United States > Texas > Harris County > Houston (0.24)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

AAAI ConferencesMar-25-2012

Harnessing the Crowds for Automating the Identification of Web APIs

Pedrinaci, Carlos (The Open University) | Liu, Dong (The Open University) | Lin, Chenghua (The Open University) | Domingue, John (The Open University)

Supporting the efficient discovery and use of Web APIs is increasingly important as their use and popularity grows. Yet, a simple task like finding potentially interesting APIs and their related documentation turns out to be hard and time consuming even when using the best resources currently available on the Web. In this paper we describe our research towards an automated Web API documentation crawler and search engine. This paper presents two main contributions. First, we have devised and exploited crowdsourcing techniques to generate a curated dataset of Web APIs documentation. Second, thanks to this dataset, we have devised an engine able to automatically detect documentation pages. Our preliminary experiments have shown that we obtain an accuracy of 80% and a precision increase of 15 points over a keyword-based heuristic we have used as baseline.

api, dataset, documentation, (17 more...)

AAAI Conferences

2012 AAAI Spring Symposium Series

Country:

Oceania > New Zealand > North Island > Waikato (0.04)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > California > Orange County > Irvine (0.04)
Europe > United Kingdom > England > Buckinghamshire > Milton Keynes (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Information Management > Search (0.88)
Information Technology > Communications > Web > Semantic Web (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

arXiv.org Machine LearningMar-15-2012

Modeling Multiple Annotator Expertise in the Semi-Supervised Learning Scenario

Yan, Yan, Rosales, Romer, Fung, Glenn, Dy, Jennifer

Learning algorithms normally assume that there is at most one annotation or label per data point. However, in some scenarios, such as medical diagnosis and on-line collaboration,multiple annotations may be available. In either case, obtaining labels for data points can be expensive and time-consuming (in some circumstances ground-truth may not exist). Semi-supervised learning approaches have shown that utilizing the unlabeled data is often beneficial in these cases. This paper presents a probabilistic semi-supervised model and algorithm that allows for learning from both unlabeled and labeled data in the presence of multiple annotators. We assume that it is known what annotator labeled which data points. The proposed approach produces annotator models that allow us to provide (1) estimates of the true label and (2) annotator variable expertise for both labeled and unlabeled data. We provide numerical comparisons under various scenarios and with respect to standard semi-supervised learning. Experiments showed that the presented approach provides clear advantages over multi-annotator methods that do not use the unlabeled data and over methods that do not use multi-labeler information.

annotator, artificial intelligence, machine learning, (17 more...)

1203.3529

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.93)
Health & Medicine > Diagnostic Medicine (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)

Yuan, null, Qi, null, Abdel-Gawad, Ahmed H., Minka, Thomas P.

Sparse-posterior Gaussian Processes for general likelihoods

arXiv.org Machine LearningMar-15-2012

Gaussian processes (GPs) provide a probabilistic nonparametric representation of functions in regression, classification, and other problems. Unfortunately, exact learning with GPs is intractable for large datasets. A variety of approximate GP methods have been proposed that essentially map the large dataset into a small set of basis points. Among them, two state-of-the-art methods are sparse pseudo-input Gaussian process (SPGP) (Snelson and Ghahramani, 2006) and variablesigma GP (VSGP) Walder et al. (2008), which generalizes SPGP and allows each basis point to have its own length scale. However, VSGP was only derived for regression. In this paper, we propose a new sparse GP framework that uses expectation propagation to directly approximate general GP likelihoods using a sparse and smooth basis. It includes both SPGP and VSGP for regression as special cases. Plus as an EP algorithm, it inherits the ability to process data online. As a particular choice of approximating family, we blur each basis point with a Gaussian distribution that has a full covariance matrix representing the data distribution around that basis point; as a result, we can summarize local data manifold information with a small set of basis points. Our experiments demonstrate that this framework outperforms previous GP classification methods on benchmark datasets in terms of minimizing divergence to the non-sparse GP solution as well as lower misclassification rate.

artificial intelligence, basis point, machine learning, (17 more...)

1203.3507

Country: North America > United States > Indiana > Tippecanoe County (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)