We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al.  and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to nontrivial bound values and - for maximum margins - to a vanishing complexity term.Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1 Introduction Linear classifiers are exceedingly popular in the machine learning community due to their straightforward applicability and high flexibility which has recently been boosted by the so-called kernel methods . A natural and popular framework for the theoretical analysis of classifiers is the PAC (probably approximately correct) framework which is closely related to Vapnik's work on the generalisation error . For binary classifiers it turned out that the growth function is an appropriate measureof "complexity" and can tightly be upper bounded by the VC (Vapnik-Chervonenkis) dimension .
Regulation of gene expression often involves proteins that bind to particular regions of DNA. Determining the binding sites for a protein and its specificity usually requires extensive biochemical and/or genetic experimentation. In this paper we illustrate the use of a neural network to obtain the desired information with much less experimental effort. It is often fairly easy to obtain a set of moderate length sequences, perhaps one or two hundred base-pairs, that each contain binding sites for the protein being studied. For example, the upstream regions of a set of genes that are all regulated by the same protein should each contain binding sites for that protein.
Work in the classification literature has shown that in computing a classification function, one need not know the class membership of all observations in the training set; the unlabeled observations still provide information on the marginal distribution of the feature set, and can thus contribute to increased classification accuracy for future observations. The present paper will show that this scheme can also be used for the estimation of class prior probabilities, which would be very useful in applications in which it is difficult or expensive to determine class membership. Both parametric and nonparametric estimators are developed. Asymptotic distributions of the estimators are derived, and it is proven that the use of the unlabeled observations does reduce asymptotic variance. This methodology is also extended to the estimation of subclass probabilities.
We propose a Bayesian nonparametric approach to the problem of jointly modeling multiple related time series. Our approach is based on the discovery of a set of latent, shared dynamical behaviors. Using a beta process prior, the size of the set and the sharing pattern are both inferred from data. We develop efficient Markov chain Monte Carlo methods based on the Indian buffet process representation of the predictive distribution of the beta process, without relying on a truncated model. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth and death proposals. We examine the benefits of our proposed feature-based model on several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.
The National Airspace System (NAS) is a large and complex system with thousands of interrelated components: administration, control centers, airports, airlines, aircraft, passengers, etc. The complexity of the NAS creates many difficulties in management and control. One of the most pressing problems is flight delay. Delay creates high cost to airlines, complaints from passengers, and difficulties for airport operations. As demand on the system increases, the delay problem becomes more and more prominent. For this reason, it is essential for the Federal Aviation Administration to understand the causes of delay and to find ways to reduce delay. Major contributing factors to delay are congestion at the origin airport, weather, increasing demand, and air traffic management (ATM) decisions such as the Ground Delay Programs (GDP). Delay is an inherently stochastic phenomenon. Even if all known causal factors could be accounted for, macro-level national airspace system (NAS) delays could not be predicted with certainty from micro-level aircraft information. This paper presents a stochastic model that uses Bayesian Networks (BNs) to model the relationships among different components of aircraft delay and the causal factors that affect delays. A case study on delays of departure flights from Chicago O'Hare international airport (ORD) to Hartsfield-Jackson Atlanta International Airport (ATL) reveals how local and system level environmental and human-caused factors combine to affect components of delay, and how these components contribute to the final arrival delay at the destination airport.