Performance Analysis
Online prediction of ovarian cancer
Zhdanov, Fedor, Vovk, Vladimir, Burford, Brian, Devetyarov, Dmitry, Nouretdinov, Ilia, Gammerman, Alex
In this paper we apply computer learning methods to diagnosing ovarian cancer using the level of the standard biomarker CA125 in conjunction with information provided by mass-spectrometry. We are working with a new data set collected over a period of 7 years. Using the level of CA125 and mass-spectrometry peaks, our algorithm gives probability predictions for the disease. To estimate classification accuracy we convert probability predictions into strict predictions. Our algorithm makes fewer errors than almost any linear combination of the CA125 level and one peak's intensity (taken on the log scale). To check the power of our algorithm we use it to test the hypothesis that CA125 and the peaks do not contain useful information for the prediction of the disease at a particular time before the diagnosis. Our algorithm produces $p$-values that are better than those produced by the algorithm that has been previously applied to this data set. Our conclusion is that the proposed algorithm is more reliable for prediction on new data.
Wikipedia-based Semantic Interpretation for Natural Language Processing
Gabrilovich, E., Markovitch, S.
Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.
Identification of Pleonastic It Using the Web
Li, Y., Musilek, P., Reformat, M., Wyard-Scott, L.
In a significant minority of cases, certain pronouns, especially the pronoun it, can be used without referring to any specific entity. This phenomenon of pleonastic pronoun usage poses serious problems for systems aiming at even a shallow understanding of natural language texts. In this paper, a novel approach is proposed to identify such uses of it: the extrapositional cases are identified using a series of queries against the web, and the cleft cases are identified using a simple set of syntactic rules. The system is evaluated with four sets of news articles containing 679 extrapositional cases as well as 78 cleft constructs. The identification results are comparable to those obtained by human efforts.
Unsupervised Methods for Determining Object and Relation Synonyms on the Web
The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fully-implemented system that runs in O(KN log N) time in the number of extractions, N, and the maximum number of synonyms per word, K. The system, called Resolver , introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78% precision and 68% recall, and resolves relations with 90% precision and 35% recall. Several variations of resolver's probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97% precision and 95% recall on a data set from the TREC corpus.
Efficiently Learning a Detection Cascade with Sparse Eigenvectors
Shen, Chunhua, Paisitkriangkrai, Sakrapee, Zhang, Jian
In this work, we first show that feature selection methods other than boosting can also be used for training an efficient object detector. In particular, we introduce Greedy Sparse Linear Discriminant Analysis (GSLDA) \cite{Moghaddam2007Fast} for its conceptual simplicity and computational efficiency; and slightly better detection performance is achieved compared with \cite{Viola2004Robust}. Moreover, we propose a new technique, termed Boosted Greedy Sparse Linear Discriminant Analysis (BGSLDA), to efficiently train a detection cascade. BGSLDA exploits the sample re-weighting property of boosting and the class-separability criterion of GSLDA.
Behavior Bounding: An Efficient Method for High-Level Behavior Comparison
In this paper, we explore methods for comparing agent behavior with human behavior to assist with validation. Our exploration begins by considering a simple method of behavior comparison. Motivated by shortcomings in this initial approach, we introduce behavior bounding, an automated model-based approach for comparing behavior that is inspired, in part, by Mitchell's Version Spaces. We show that behavior bounding can be used to compactly represent both human and agent behavior. We argue that relatively low amounts of human effort are required to build, maintain, and use the data structures that underlie behavior bounding, and we provide a theoretical basis for these arguments using notions of PAC Learnability. Next, we show empirical results indicating that this approach is effective at identifying differences in certain types of behaviors and that it performs well when compared against our initial benchmark methods. Finally, we demonstrate that behavior bounding can produce information that allows developers to identify and fix problems in an agent's behavior much more efficiently than standard debugging techniques.
The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Liu, Han, Lafferty, John, Wasserman, Larry
Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula--or "nonparanormal"--for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method's theoretical properties, and show that it works well in many examples.
Domain Adaptation: Learning Bounds and Algorithms
Mansour, Yishay, Mohri, Mehryar, Rostamizadeh, Afshin
This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by Ben-David et al. (2007), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functions. We give Rademacher complexity bounds for estimating the discrepancy distance from finite samples for different loss functions. Using this distance, we derive novel generalization bounds for domain adaptation for a wide family of loss functions. We also present a series of novel adaptation bounds for large classes of regularization-based algorithms, including support vector machines and kernel ridge regression based on the empirical discrepancy. This motivates our analysis of the problem of minimizing the empirical discrepancy for various loss functions for which we also give novel algorithms. We report the results of preliminary experiments that demonstrate the benefits of our discrepancy minimization algorithms for domain adaptation.
Sparse Causal Discovery in Multivariate Time Series
Haufe, Stefan, Nolte, Guido, Mueller, Klaus-Robert, Kraemer, Nicole
Our goal is to estimate causal interactions in multivariate time series. Using vector autoregressive (VAR) models, these can be defined based on non-vanishing coefficients belonging to respective time-lagged instances. As in most cases a parsimonious causality structure is assumed, a promising approach to causal discovery consists in fitting VAR models with an additional sparsity-promoting regularization. Along this line we here propose that sparsity should be enforced for the subgroups of coefficients that belong to each pair of time series, as the absence of a causal relation requires the coefficients for all time-lags to become jointly zero. Such behavior can be achieved by means of l1-l2-norm regularized regression, for which an efficient active set solver has been proposed recently. Our method is shown to outperform standard methods in recovering simulated causality graphs. The results are on par with a second novel approach which uses multiple statistical testing.
Information, Divergence and Risk for Binary Experiments
Reid, Mark D., Williamson, Robert C.
We unify f-divergences, Bregman divergences, surrogate loss bounds (regret bounds), proper scoring rules, matching losses, cost curves, ROC-curves and information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their primitives which all are related to cost-sensitive binary classification. As well as clarifying relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate loss bounds and generalised Pinsker inequalities relating f-divergences to variational divergence. The new viewpoint illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants. It also suggests new techniques for estimating f-divergences.