Country
Identifying Mislabeled Training Data
This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classifiers that serve as noise filters for the training data. We evaluate single algorithm, majority vote and consensus filters on five datasets that are prone to labeling errors. Our experiments illustrate that filtering significantly improves classification accuracy for noise levels up to 30 percent. An analytical and empirical evaluation of the precision of our approach shows that consensus filters are conservative at throwing away good data at the expense of retaining bad data and that majority filters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus filters are preferable, whereas majority vote filters are preferable for situations with an abundance of data.
The Good Old Davis-Putnam Procedure Helps Counting Models
Birnbaum, E., Lozinskii, E. L.
As was shown recently, many important AI problems require counting the number of models of propositional formulas. The problem of counting models of such formulas is, according to present knowledge, computationally intractable in a worst case. Based on the Davis-Putnam procedure, we present an algorithm, CDP, that computes the exact number of models of a propositional CNF or DNF formula F. Let m and n be the number of clauses and variables of F, respectively, and let p denote the probability that a literal l of F occurs in a clause C of F, then the average running time of CDP is shown to be O(nm^d), where d=-1/log(1-p). The practical performance of CDP has been estimated in a series of experiments on a wide variety of CNF formulas.
An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email
This paper describes a novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users. The method is based on a combination of reinforcement learning and performance modeling of spoken dialogue systems. The reinforcement learning component applies Q-learning (Watkins, 1989), while the performance modeling component applies the PARADISE evaluation framework (Walker et al., 1997) to learn the performance function (reward) used in reinforcement learning. We illustrate the method with a spoken dialogue system named ELVIS (EmaiL Voice Interactive System), that supports access to email over the phone. We conduct a set of experiments for training an optimal dialogue strategy on a corpus of 219 dialogues in which human users interact with ELVIS over the phone. We then test that strategy on a corpus of 18 dialogues. We show that ELVIS can learn to optimize its strategy selection for agent initiative, for reading messages, and for summarizing email folders.
ProDiGe: PRioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples
Mordelet, Fantine, Vert, Jean-Philippe
Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Here we propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.
Activity-Based Search for Black-Box Contraint-Programming Solvers
Michel, L., Van Hentenryck, P.
Robust search procedures are a central component in the design of black-box constraint-programming solvers. This paper proposes activity-based search, the idea of using the activity of variables during propagation to guide the search. Activity-based search was compared experimentally to impact-based search and the WDEG heuristics. Experimental results on a variety of benchmarks show that activity-based search is more robust than other heuristics and may produce significant improvements in performance.
Reasoning on Interval and Point-based Disjunctive Metric Constraints in Temporal Contexts
We introduce a temporal model for reasoning on disjunctive metric constraints on intervals and time points in temporal contexts. This temporal model is composed of a labeled temporal algebra and its reasoning algorithms. The labeled temporal algebra defines labeled disjunctive metric point-based constraints, where each disjunct in each input disjunctive constraint is univocally associated to a label. Reasoning algorithms manage labeled constraints, associated label lists, and sets of mutually inconsistent disjuncts. These algorithms guarantee consistency and obtain a minimal network. Additionally, constraints can be organized in a hierarchy of alternative temporal contexts. Therefore, we can reason on context-dependent disjunctive metric constraints on intervals and points. Moreover, the model is able to represent non-binary constraints, such that logical dependencies on disjuncts in constraints can be handled. The computational cost of reasoning algorithms is exponential in accordance with the underlying problem complexity, although some improvements are proposed.
Context models on sequences of covers
We present a class of models that, via a simple construction, enables exact, incremental, non-parametric, polynomial-time, Bayesian inference of conditional measures. The approach relies upon creating a sequence of covers on the conditioning variable and maintaining a different model for each set within a cover. Inference remains tractable by specifying the probabilistic model in terms of a random walk within the sequence of covers. We demonstrate the approach on problems of conditional density estimation, which, to our knowledge is the first closed-form, non-parametric Bayesian approach to this problem.
Complexity of and Algorithms for Borda Manipulation
Davies, Jessica, Katsirelos, George, Narodytska, Nina, Walsh, Toby
We prove that it is NP-hard for a coalition of two manipulators to compute how to manipulate the Borda voting rule. This resolves one of the last open problems in the computational complexity of manipulating common voting rules. Because of this NP-hardness, we treat computing a manipulation as an approximation problem where we try to minimize the number of manipulators. Based on ideas from bin packing and multiprocessor scheduling, we propose two new approximation methods to compute manipulations of the Borda rule. Experiments show that these methods significantly outperform the previous best known %existing approximation method. We are able to find optimal manipulations in almost all the randomly generated elections tested. Our results suggest that, whilst computing a manipulation of the Borda rule by a coalition is NP-hard, computational complexity may provide only a weak barrier against manipulation in practice.
AntNet: Distributed Stigmergetic Control for Communications Networks
This paper introduces AntNet, a novel approach to the adaptive learning of routing tables in communications networks. AntNet is a distributed, mobile agents based Monte Carlo system that was inspired by recent work on the ant colony metaphor for solving optimization problems. AntNet's agents concurrently explore the network and exchange collected information. The communication among the agents is indirect and asynchronous, mediated by the network itself. This form of communication is typical of social insects and is called stigmergy. We compare our algorithm with six state-of-the-art routing algorithms coming from the telecommunications and machine learning fields. The algorithms' performance is evaluated over a set of realistic testbeds. We run many experiments over real and artificial IP datagram networks with increasing number of nodes and under several paradigmatic spatial and temporal traffic distributions. Results are very encouraging. AntNet showed superior performance under all the experimental conditions with respect to its competitors. We analyze the main characteristics of the algorithm and try to explain the reasons for its superiority.
Learning to Order Things
Cohen, W. W., Schapire, R. E., Singer, Y.
There are many applications in which it is desirable to order rather than classify instances. Here we consider the problem of learning how to order instances given feedback in the form of preference judgments, i.e., statements to the effect that one instance should be ranked ahead of another. We outline a two-stage approach in which one first learns by conventional means a binary preference function indicating whether it is advisable to rank one instance before another. Here we consider an on-line algorithm for learning preference functions that is based on Freund and Schapire's 'Hedge' algorithm. In the second stage, new instances are ordered so as to maximize agreement with the learned preference function. We show that the problem of finding the ordering that agrees best with a learned preference function is NP-complete. Nevertheless, we describe simple greedy algorithms that are guaranteed to find a good approximation. Finally, we show how metasearch can be formulated as an ordering problem, and present experimental results on learning a combination of 'search experts', each of which is a domain-specific query expansion strategy for a web search engine.