Genre
Obtaining Calibrated Probabilities from Boosting
Niculescu-Mizil, Alexandru, Caruana, Richard A.
Boosted decision trees typically yield good accuracy, precision, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boosting yields poor squared error and cross-entropy. We empirically demonstrate why AdaBoost predicts distorted probabilities and examine three calibration methods for correcting this distortion: Platt Scaling, Isotonic Regression, and Logistic Correction. We also experiment with boosting using log-loss instead of the usual exponential loss. Experiments show that Logistic Correction and boosting with log-loss work well when boosting weak models such as decision stumps, but yield poor performance when boosting more complex models such as full decision trees. Platt Scaling and Isotonic Regression, however, significantly improve the probabilities predicted by
Toward Practical N2 Monte Carlo: the Marginal Particle Filter
Klaas, Mike, de Freitas, Nando, Doucet, Arnaud
Sequential Monte Carlo techniques are useful for state estimation in non-linear, non-Gaussian dynamic models. These methods allow us to approximate the joint posterior distribution using sequential importance sampling. In this framework, the dimension of the target distribution grows with each time step, thus it is necessary to introduce some resampling steps to ensure that the estimates provided by the algorithm have a reasonable variance. In many applications, we are only interested in the marginal filtering distribution which is defined on a space of fixed dimension. We present a Sequential Monte Carlo algorithm called the Marginal Particle Filter which operates directly on the marginal distribution, hence avoiding having to perform importance sampling on a space of growing dimension. Using this idea, we also derive an improved version of the auxiliary particle filter. We show theoretic and empirical results which demonstrate a reduction in variance over conventional particle filtering, and present techniques for reducing the cost of the marginal particle filter with N particles from O(N2) to O(N logN).
Learning about individuals from group statistics
Kuck, Hendrik, de Freitas, Nando
We propose a new problem formulation which is similar to, but more informative than, the binary multiple-instance learning problem. In this setting, we are given groups of instances (described by feature vectors) along with estimates of the fraction of positively-labeled instances per group. The task is to learn an instance level classifier from this information. That is, we are trying to estimate the unknown binary labels of individuals from knowledge of group statistics. We propose a principled probabilistic model to solve this problem that accounts for uncertainty in the parameters and in the unknown individual labels. This model is trained with an efficient MCMC algorithm. Its performance is demonstrated on both synthetic and real-world data arising in general object recognition.
Learning Bayesian Network Parameters with Prior Knowledge about Context-Specific Qualitative Influences
Feelders, Ad, van der Gaag, Linda C.
We present a method for learning the parameters of a Bayesian network with prior knowledge about the signs of influences between variables. Our method accommodates not just the standard signs, but provides for context-specific signs as well. We show how the various signs translate into order constraints on the network parameters and how isotonic regression can be used to compute order-constrained estimates from the available data. Our experimental results show that taking prior knowledge about the signs of influences into account leads to an improved fit of the true distribution, especially when only a small sample of data is available. Moreover, the computed estimates are guaranteed to be consistent with the specified signs, thereby resulting in a network that is more likely to be accepted by experts in its domain of application.
Maximum Margin Bayesian Networks
Guo, Yuhong, Wilkinson, Dana, Schuurmans, Dale
We consider the problem of learning Bayesian network classifiers that maximize the margin over a set of classification variables. We find that this problem is harder for Bayesian networks than for undirected graphical models like maximum margin Markov networks. The main difficulty is that the parameters in a Bayesian network must satisfy additional normalization constraints that an undirected graphical model need not respect. These additional constraints complicate the optimization task. Nevertheless, we derive an effective training algorithm that solves the maximum margin training problem for a range of Bayesian network topologies, and converges to an approximate solution for arbitrary network topologies. Experimental results show that the method can demonstrate improved generalization performance over Markov networks when the directed graphical structure encodes relevant knowledge. In practice, the training technique allows one to combine prior knowledge expressed as a directed (causal) model with state of the art discriminative learning methods.
Bayes Blocks: An Implementation of the Variational Bayesian Building Blocks Framework
Harva, Markus, Raiko, Tapani, Honkela, Antti, Valpola, Harri, Karhunen, Juha
A software library for constructing and learning probabilistic models is presented. The library offers a set of building blocks from which a large variety of static and dynamic models can be built. These include hierarchical models for variances of other variables and many nonlinear models. The underlying variational Bayesian machinery, providing for fast and robust estimation but being mathematically rather involved, is almost completely hidden from the user thus making it very easy to use the library. The building blocks include Gaussian, rectified Gaussian and mixture-of-Gaussians variables and computational nodes which can be combined rather freely.
On the Detection of Concept Changes in Time-Varying Data Stream by Testing Exchangeability
Ho, Shen-Shyang, Wechsler, Harry
A martingale framework for concept change detection based on testing data exchangeability was recently proposed (Ho, 2005). In this paper, we describe the proposed change-detection test based on the Doob's Maximal Inequality and show that it is an approximation of the sequential probability ratio test (SPRT). The relationship between the threshold value used in the proposed test and its size and power is deduced from the approximation. The mean delay time before a change is detected is estimated using the average sample number of a SPRT. The performance of the test using various threshold values is examined on five different data stream scenarios simulated using two synthetic data sets. Finally, experimental results show that the test is effective in detecting changes in time-varying data streams simulated using three benchmark data sets.
Learning Factor Graphs in Polynomial Time & Sample Complexity
Abbeel, Pieter, Koller, Daphne, Ng, Andrew Y.
We study computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded factor size and bounded connectivity can be learned in polynomial time and polynomial number of samples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Unlike maximum likelihood estimation, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks.
Learning from Sparse Data by Exploiting Monotonicity Constraints
Altendorf, Eric E., Restificar, Angelo C., Dietterich, Thomas G.
When training data is sparse, more domain knowledge must be incorporated into the learning algorithm in order to reduce the effective size of the hypothesis space. This paper builds on previous work in which knowledge about qualitative monotonicities was formally represented and incorporated into learning algorithms (e.g., Clark & Matwin's work with the CN2 rule learning algorithm). We show how to interpret knowledge of qualitative influences, and in particular of monotonicities, as constraints on probability distributions, and to incorporate this knowledge into Bayesian network learning algorithms. We show that this yields improved accuracy, particularly with very small training sets (e.g. less than 10 examples).
Unsupervised spectral learning
Shortreed, Susan, Meila, Marina
In spectral clustering and spectral image segmentation, the data is partioned starting from a given matrix of pairwise similarities S. the matrix S is constructed by hand, or learned on a separate training set. In this paper we show how to achieve spectral clustering in unsupervised mode. Our algorithm starts with a set of observed pairwise features, which are possible components of an unknown, parametric similarity function. This function is learned iteratively, at the same time as the clustering of the data. The algorithm shows promosing results on synthetic and real data.