Caruana, Rich
Using multiple samples to learn mixture models
Lee, Jason D., Gilad-Bachrach, Ran, Caruana, Rich
In the mixture models problem it is assumed that there are $K$ distributions $\theta_{1},\ldots,\theta_{K}$ and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same $K$ underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data.
Using Multiple Samples to Learn Mixture Models
Lee, Jason D, Gilad-Bachrach, Ran, Caruana, Rich
In the mixture models problem it is assumed that there are $K$ distributions $\theta_{1},\ldots,\theta_{K}$ and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same $K$ underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data.
(Not) Bounding the True Error
Langford, John, Caruana, Rich
We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound.
(Not) Bounding the True Error
Langford, John, Caruana, Rich
We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs adistribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound.
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
Caruana, Rich, Lawrence, Steve, Giles, C. Lee
The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity.
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
Caruana, Rich, Lawrence, Steve, Giles, C. Lee
The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest tworeasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity.
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
Caruana, Rich, Lawrence, Steve, Giles, C. Lee
The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity.
Promoting Poor Features to Supervisors: Some Inputs Work Better as Outputs
Caruana, Rich, Sa, Virginia R. de
In supervised learning there is usually a clear distinction between inputs and outputs - inputs are what you will measure, outputs are what you will predict from those measurements. This paper shows that the distinction between inputs and outputs is not this Some features are more useful as extra outputs than assimple. By using a feature as an output we get more than just the case values but can. For many features this mapping may be more useful than the feature value itself. We present two regression problems and one classification problem where performance improves if features that could have been used as inputs are used as extra outputs instead.
Promoting Poor Features to Supervisors: Some Inputs Work Better as Outputs
Caruana, Rich, Sa, Virginia R. de
In supervised learning there is usually a clear distinction between inputs and outputs - inputs are what you will measure, outputs are what you will predict from those measurements. This paper shows that the distinction between inputs and outputs is not this simple. Some features are more useful as extra outputs than as inputs. By using a feature as an output we get more than just the case values but can. For many features this mapping may be more useful than the feature value itself.
Using the Future to "Sort Out" the Present: Rankprop and Multitask Learning for Medical Risk Evaluation
Caruana, Rich, Baluja, Shumeet, Mitchell, Tom
This paper presents two methods that can improve generalization on a broad class of problems. This class includes identifying low risk pneumonia patients. The first method, rankprop, tries to learn simple models that support ranking future cases while simultaneously learning to rank the training set. The second, multitask learning, uses lab tests available only during training, as additional target values to bias learning towards a more predictive hidden layer. Experiments using a database of pneumonia patients indicate that together these methods outperform standard backpropagation by 10-50%. Rankprop and MTL are applicable to a large class of problems in which the goal is to learn a relative ranking over the instance space, and where the training data includes features that will not be available at run time. Such problems include identifying higher-risk medical patients as early as possible, identifying lower-risk financial investments, and visual analysis of scenes that become easier to analyze as they are approached in the future. Acknowledgements We thank Greg Cooper, Michael Fine, and other members of the Pitt/CMU Cost-Effective Health Care group for help with the Medis Database. This work was supported by ARPA grant F33615-93-1-1330, NSF grant BES-9315428, Agency for Health Care Policy and Research grant HS06468, and an NSF Graduate Student Fellowship (Baluja).