The National Airspace System (NAS) is a large and complex system with thousands of interrelated components: administration, control centers, airports, airlines, aircraft, passengers, etc. The complexity of the NAS creates many difficulties in management and control. One of the most pressing problems is flight delay. Delay creates high cost to airlines, complaints from passengers, and difficulties for airport operations. As demand on the system increases, the delay problem becomes more and more prominent. For this reason, it is essential for the Federal Aviation Administration to understand the causes of delay and to find ways to reduce delay. Major contributing factors to delay are congestion at the origin airport, weather, increasing demand, and air traffic management (ATM) decisions such as the Ground Delay Programs (GDP). Delay is an inherently stochastic phenomenon. Even if all known causal factors could be accounted for, macro-level national airspace system (NAS) delays could not be predicted with certainty from micro-level aircraft information. This paper presents a stochastic model that uses Bayesian Networks (BNs) to model the relationships among different components of aircraft delay and the causal factors that affect delays. A case study on delays of departure flights from Chicago O'Hare international airport (ORD) to Hartsfield-Jackson Atlanta International Airport (ATL) reveals how local and system level environmental and human-caused factors combine to affect components of delay, and how these components contribute to the final arrival delay at the destination airport.
Regulation of gene expression often involves proteins that bind to particular regions of DNA. Determining the binding sites for a protein and its specificity usually requires extensive biochemical and/or genetic experimentation. In this paper we illustrate the use of a neural network to obtain the desired information with much less experimental effort. It is often fairly easy to obtain a set of moderate length sequences, perhaps one or two hundred base-pairs, that each contain binding sites for the protein being studied. For example, the upstream regions of a set of genes that are all regulated by the same protein should each contain binding sites for that protein.
We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al.  and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to nontrivial bound values and - for maximum margins - to a vanishing complexity term.Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1 Introduction Linear classifiers are exceedingly popular in the machine learning community due to their straightforward applicability and high flexibility which has recently been boosted by the so-called kernel methods . A natural and popular framework for the theoretical analysis of classifiers is the PAC (probably approximately correct) framework which is closely related to Vapnik's work on the generalisation error . For binary classifiers it turned out that the growth function is an appropriate measureof "complexity" and can tightly be upper bounded by the VC (Vapnik-Chervonenkis) dimension .
This paper analyzes consumer choices over lunchtime restaurants using data from a sample of several thousand anonymous mobile phone users in the San Francisco Bay Area. The data is used to identify users' approximate typical morning location, as well as their choices of lunchtime restaurants. We build a model where restaurants have latent characteristics (whose distribution may depend on restaurant observables, such as star ratings, food category, and price range), each user has preferences for these latent characteristics, and these preferences are heterogeneous across users. Similarly, each item has latent characteristics that describe users' willingness to travel to the restaurant, and each user has individual-specific preferences for those latent characteristics. Thus, both users' willingness to travel and their base utility for each restaurant vary across user-restaurant pairs. We use a Bayesian approach to estimation. To make the estimation computationally feasible, we rely on variational inference to approximate the posterior distribution, as well as stochastic gradient descent as a computational approach. Our model performs better than more standard competing models such as multinomial logit and nested logit models, in part due to the personalization of the estimates. We analyze how consumers re-allocate their demand after a restaurant closes to nearby restaurants versus more distant restaurants with similar characteristics, and we compare our predictions to actual outcomes. Finally, we show how the model can be used to analyze counterfactual questions such as what type of restaurant would attract the most consumers in a given location.
Leslie Grate and Mark Herbster and Richard Hughey and David Haussler Baskin (;enter for Computer Engineering and Computer and Information Sciences University of California Santa Cruz, CA 95064 Keywords: RNA secondary structure, Gibbs sampler, Expectation Maximization, stochastic contextfree grammars, hidden Markov models, tP NA, snRNA, 16S rRNA, linguistic methods Abstract A new method of discovering the common secondary structure of a family of homologous RNA sequences using Gibbs sampling and stochastic context-free grammars is proposed. These parameters describe a statistical model of the family. After the Gibbs sampling has produced a crude statistical model for the family, this model is translated into a stochastic context-free grammar, which is then refined by an Expectation Maximization (EM) procedure produce a more complete model. A prototype implementation of the method is tested on tRNA, pieces of 16S rRNA and on U5 snRNA with good results. I. Saira Mian and Harry Noller Sinsheimer Laboratories University of California Santa Cruz, CA 95064 Introduction Tools for analyzing RNA are becoming increasingly important as in vitro evolution and selection techniques produce greater numbers of synthesized RNA families to supplement those related by phylogeny. Two principal methods have been established for predicting RNA secondary structure base pairings. The second technique employs thermodynamics to compare the free energy changes predicted for formation of possible s,'covdary structure and relies on finding the structure with the lowest free energy (Tinoco Jr., Uhlenbeck, & Levine 1971: Turner, Sugimoto, & Freier 1988; *This work was supported in part by NSF grants C,I)A-9115268 and IR1-9123692, and NIIt gratnt (.;M17129. When several related sequences are available that all share a common secondary structure, combinations of different approaches have been used to obtain improved results (Waterman 1989; Le & Zuker 1991; Han& Kim 1993; Chiu & Kolodziejczak 1991; Sankoff 1985; Winker et al. 1990; Lapedes 1992; Klinger & Brutlag 1993; Gutell et aL 1992). Recent efforts have applied Stochastic Context-Free Grammars (SCFGs) to the problems of statistical modeling, multiple alignment, discrimination and prediction of the secondary structure of RNA families (Sakakibara el al. 1994; 1993; Eddy & Durbin 1994; Searls 1993).