An exciting branch of machine learning research focuses on methods for learning, optimizing, and integrating unknown functions that are difficult or costly to evaluate. A popular Bayesian approach to this problem uses a Gaussian process (GP) to construct a posterior distribution over the function of interest given a set of observed measurements, and selects new points to evaluate using the statistics of this posterior. Here we extend these methods to exploit derivative information from the unknown function. We describe methods for Bayesian optimization (BO) and Bayesian quadrature (BQ) in settings where first and second derivatives may be evaluated along with the function itself. We perform sampling-based inference in order to incorporate uncertainty over hyperparameters, and show that both hyperparameter and function uncertainty decrease much more rapidly when using derivative information. Moreover, we introduce techniques for overcoming ill-conditioning issues that have plagued earlier methods for gradient-enhanced Gaussian processes and kriging. We illustrate the efficacy of these methods using applications to real and simulated Bayesian optimization and quadrature problems, and show that exploting derivatives can provide substantial gains over standard methods.
I have made some progress with my work on combining independent evidence using a Bayesian approach but eschewing standard Bayesian updating. I found a neat analytical way of doing this, to a very good approximation, in cases where each estimate of a parameter corresponds to the ratio of two variables each determined with normal error, the fractional uncertainty in the numerator and denominator variables differing between the types of evidence. This seems a not uncommon situation in science, and it is a good approximation to that which exists when estimating climate sensitivity. I have had a manuscript in which I develop and test this method accepted by the Journal of Statistical Planning and Inference (for a special issue on Confidence Distributions edited by Tore Schweder and Nils Hjort). Frequentist coverage is almost exact using my analytical solution, based on combining Jeffreys' priors in quadrature, whereas Bayesian updating produces far poorer probability matching.
Paul Graham popularized the term "Bayesian Classification" (or more accurately "Naïve Bayesian Classification") after his "A Plan for Spam" article was published (http://www.paulgraham.com/spam.html). In fact, text classifiers based on naïve Bayesian and other techniques have been around for many years. Companies such as Autonomy and Interwoven incorporate machine-learning techniques to automatically classify documents of all kinds; one such machine-learning technique is naïve Bayesian text classification. Naïve Bayesian text classifiers are fast, accurate, simple, and easy to implement. In this article, I present a complete naïve Bayesian text classifier written in 100 lines of commented, nonobfuscated Perl.
Editor's note: The following is an interview with Columbia University Professor Andrew Gelman conducted by Marketing scientist Kevin Gray, in which Gelman spells out the ABCs of Bayesian statistics. Kevin Gray: Most marketing researchers have heard of Bayesian statistics but know little about it. Can you briefly explain in layperson's terms what it is and how it differs from the'ordinary' statistics most of us learned in college? Andrew Gelman: Bayesian statistics uses the mathematical rules of probability to combines data with "prior information" to give inferences which (if the model being used is correct) are more precise than would be obtained by either source of information alone. Classical statistical methods avoid prior distributions.