Plotting

Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

Journal of Artificial Intelligence Research

This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.


Tractability of Theory Patching

Journal of Artificial Intelligence Research

In this paper we consider the problem of `theory patching', in which we are given a domain theory, some of whose components are indicated to be possibly flawed, and a set of labeled training examples for the domain concept. The theory patching problem is to revise only the indicated components of the theory, such that the resulting theory correctly classifies all the training examples. Theory patching is thus a type of theory revision in which revisions are made to individual components of the theory. Our concern in this paper is to determine for which classes of logical domain theories the theory patching problem is tractable. We consider both propositional and first-order domain theories, and show that the theory patching problem is equivalent to that of determining what information contained in a theory is `stable' regardless of what revisions might be performed to the theory. We show that determining stability is tractable if the input theory satisfies two conditions: that revisions to each theory component have monotonic effects on the classification of examples, and that theory components act independently in the classification of examples in the theory. We also show how the concepts introduced can be used to determine the soundness and completeness of particular theory patching algorithms.


Monotonicity and Persistence in Preferential Logics

Journal of Artificial Intelligence Research

An important characteristic of many logics for Artificial Intelligence is their nonmonotonicity. This means that adding a formula to the premises can invalidate some of the consequences. There may, however, exist formulae that can always be safely added to the premises without destroying any of the consequences: we say they respect monotonicity. Also, there may be formulae that, when they are a consequence, can not be invalidated when adding any formula to the premises: we call them conservative. We study these two classes of formulae for preferential logics, and show that they are closely linked to the formulae whose truth-value is preserved along the (preferential) ordering. We will consider some preferential logics for illustration, and prove syntactic characterization results for them. The results in this paper may improve the efficiency of theorem provers for preferential logics.


Minimizing Statistical Bias with Queries

Neural Information Processing Systems

I describe a querying criterion that attempts to minimize the error of a learner by minimizing its estimated squared bias. I describe experiments with locally-weighted regression on two simple problems, andobserve that this "bias-only" approach outperforms the more common "variance-only" exploration approach, even in the presence of noise.


488 Solutions to the XOR Problem

Neural Information Processing Systems

A globally convergent homotopy method is defined that is capable of sequentially producing large numbers of stationary points of the multi-layer perceptron mean-squared error surface. Using this algorithm largesubsets of the stationary points of two test problems are found. It is shown empirically that the MLP neural network appears to have an extreme ratio of saddle points compared to local minima, and that even small neural network problems have extremely large numbers of solutions.


Representation and Induction of Finite State Machines using Time-Delay Neural Networks

Neural Information Processing Systems

This work investigates the representational and inductive capabilities oftime-delay neural networks (TDNNs) in general, and of two subclasses of TDNN, those with delays only on the inputs (IDNN), and those which include delays on hidden units (HDNN). Both architectures arecapable of representing the same class of languages, the definite memory machine (DMM) languages, but the delays on the hidden units in the HDNN helps it outperform the IDNN on problems composed of repeated features over short time windows. 1 Introduction In this paper we consider the representational and inductive capabilities of timedelay neuralnetworks (TDNN) [Waibel et al., 1989] [Lang et al., 1990], also known as NNFIR [Wan, 1993]. A TDNN is a feed-forward network in which the set of inputs to any node i may include the output from previous layers not only in the current time step t, but from d earlier time steps as well. The activation function 404 D.S. Clouse, C. L Giles, B. G. Home and G. W. Cottrell for node i at time t in such a network is given by equation 1: TDNNs have been used in speech recognition [Waibel et al., 1989], and time series prediction [Wan, 1993]. In this paper we concentrate on the language induction problem.


Neural Models for Part-Whole Hierarchies

Neural Information Processing Systems

We present a connectionist method for representing images that explicitly addressestheir hierarchical nature. It blends data from neuroscience aboutwhole-object viewpoint sensitive cells in inferotemporal cortex8 and attentional basis-field modulation in V43 with ideas about hierarchical descriptions based on microfeatures.5,11 The resulting model makes critical use of bottom-up and top-down pathways for analysis and synthesis.


Adaptive On-line Learning in Changing Environments

Neural Information Processing Systems

An adaptive online algorithm extending the learning of learning idea is proposed and theoretically motivated. Relying only on gradient flowinformation it can be applied to learning continuous functions or distributions, even when no explicit loss function is given andthe Hessian is not available. Its efficiency is demonstrated for a non-stationary blind separation task of acoustic signals. 1 Introduction Neural networks provide powerful tools to capture the structure in data by learning. Often the batch learning paradigm is assumed, where the learner is given all training examplessimultaneously and allowed to use them as often as desired. In large practical applications batch learning is often experienced to be rather infeasible and instead online learning is employed.


Dual Kalman Filtering Methods for Nonlinear Prediction, Smoothing and Estimation

Neural Information Processing Systems

Prediction, estimation, and smoothing are fundamental to signal processing. To perform these interrelated tasks given noisy data, we form a time series model of the process that generates the data. Taking noise in the system explicitly into account, maximumlikelihood andKalman frameworks are discussed which involve the dual process of estimating both the model parameters and the underlying stateof the system. We review several established methods in the linear case, and propose severa!


Probabilistic Interpretation of Population Codes

Neural Information Processing Systems

We present a theoretical framework for population codes which generalizes naturally to the important case where the population provides information about a whole probability distribution over an underlying quantity rather than just a single value. We use the framework to analyze two existing models, and to suggest and evaluate a third model for encoding such probability distributions.