AITopics | Williamson, Robert C.

Le Cam meets LeCun: Deficiency and Generic Feature Learning

van Rooyen, Brendan, Williamson, Robert C.

arXiv.org Machine LearningFeb-21-2014

"Deep Learning" methods attempt to learn generic features in an unsupervised fashion from a large unlabelled data set. These generic features should perform as well as the best hand crafted features for any learning problem that makes use of this data. We provide a definition of generic features, characterize when it is possible to learn them and provide methods closely related to the autoencoder and deep belief network of deep learning. In order to do so we use the notion of deficiency and illustrate its value in studying certain general learning problems.

deep learning, experiment, neural network, (20 more...)

arXiv.org Machine Learning

1402.4884

Country: Oceania > Australia (0.14)

Genre: Research Report (0.40)

Industry: Education > Focused Education > Special Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.88)

Add feedback

Mixability in Statistical Learning

Erven, Tim V., Grünwald, Peter, Reid, Mark D., Williamson, Robert C.

Neural Information Processing SystemsDec-31-2012

Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability.

artificial intelligence, bayesian inference, mixability, (17 more...)

Neural Information Processing Systems

Country:

Oceania > Australia (0.14)
North America > United States (0.14)
Europe > Netherlands (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)

Add feedback

Composite Multiclass Losses

Vernet, Elodie, Reid, Mark D., Williamson, Robert C.

Neural Information Processing SystemsDec-31-2011

We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a ``proper composite loss'', which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We also show that the integral representation for binary proper losses can not be extended to multiclass losses. We subsume existing results on ``classification calibration'' by relating it to properness. We draw conclusions concerning the design of multiclass losses.

artificial intelligence, machine learning, proper loss, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Composite Binary Losses

Reid, Mark D., Williamson, Robert C.

arXiv.org Machine LearningDec-16-2009

We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and ``classification calibrated'' losses. We also consider the question of the ``best'' surrogate binary loss. We introduce a precise notion of ``best'' and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of ``surrogate tuning''. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise.

artificial intelligence, composite loss, machine learning, (18 more...)

arXiv.org Machine Learning

0912.3301

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Information, Divergence and Risk for Binary Experiments

Reid, Mark D., Williamson, Robert C.

arXiv.org Machine LearningJan-5-2009

We unify f-divergences, Bregman divergences, surrogate loss bounds (regret bounds), proper scoring rules, matching losses, cost curves, ROC-curves and information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their primitives which all are related to cost-sensitive binary classification. As well as clarifying relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate loss bounds and generalised Pinsker inequalities relating f-divergences to variational divergence. The new viewpoint illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants. It also suggests new techniques for estimating f-divergences.

artificial intelligence, divergence, survey article, (15 more...)

arXiv.org Machine Learning

0901.0356

Country:

Europe > United Kingdom > England (0.27)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry: Leisure & Entertainment > Games (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback

Hyperkernels

Ong, Cheng S., Williamson, Robert C., Smola, Alex J.

Neural Information Processing SystemsDec-31-2003

We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel HilbertSpace on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

artificial intelligence, kernel, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Oceania > Australia (0.14)

Genre: Research Report (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.56)

Add feedback

Hyperkernels

Ong, Cheng S., Williamson, Robert C., Smola, Alex J.

Neural Information Processing SystemsDec-31-2003

We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

artificial intelligence, kernel, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.29)

Genre: Research Report (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.56)

Add feedback

Kernel Machines and Boolean Functions

Kowalczyk, Adam, Smola, Alex J., Williamson, Robert C.

Neural Information Processing SystemsDec-31-2002

We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

Add feedback

Kernel Machines and Boolean Functions

Kowalczyk, Adam, Smola, Alex J., Williamson, Robert C.

Neural Information Processing SystemsDec-31-2002

We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

Add feedback

Online Learning with Kernels

Kivinen, Jyrki, Smola, Alex J., Williamson, Robert C.

Neural Information Processing SystemsDec-31-2002

We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization.

algorithm, computer based training, educational technology, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.15)

Industry: Education > Educational Setting > Online (0.61)

Technology: