AITopics | Williamson, Robert C.

Collaborating Authors

Williamson, Robert C.

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Exp-Concavity of Proper Composite Losses

Kamalaruban, Parameswaran, Williamson, Robert C., Zhang, Xinhua

arXiv.org Machine LearningMay-20-2018

The goal of online prediction with expert advice is to find a decision strategy which will perform almost as well as the best expert in a given pool of experts, on any sequence of outcomes. This problem has been widely studied and $O(\sqrt{T})$ and $O(\log{T})$ regret bounds can be achieved for convex losses (\cite{zinkevich2003online}) and strictly convex losses with bounded first and second derivatives (\cite{hazan2007logarithmic}) respectively. In special cases like the Aggregating Algorithm (\cite{vovk1995game}) with mixable losses and the Weighted Average Algorithm (\cite{kivinen1999averaging}) with exp-concave losses, it is possible to achieve $O(1)$ regret bounds. \cite{van2012exp} has argued that mixability and exp-concavity are roughly equivalent under certain conditions. Thus by understanding the underlying relationship between these two notions we can gain the best of both algorithms (strong theoretical performance guarantees of the Aggregating Algorithm and the computational efficiency of the Weighted Average Algorithm). In this paper we provide a complete characterization of the exp-concavity of any proper composite loss. Using this characterization and the mixability condition of proper losses (\cite{van2012mixability}), we show that it is possible to transform (re-parameterize) any $\beta$-mixable binary proper loss into a $\beta$-exp-concave composite loss with the same $\beta$. In the multi-class case, we propose an approximation approach for this transformation.

artificial intelligence, composite loss, machine learning, (18 more...)

arXiv.org Machine Learning

1805.07737

Country: Oceania > Australia (0.28)

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

f-GANs in an Information Geometric Nutshell

Nock, Richard, Cranko, Zac, Menon, Aditya K., Qu, Lizhen, Williamson, Robert C.

Neural Information Processing SystemsDec-31-2017

Nowozin et al showed last year how to extend the GAN principle to all f-divergences. The approach is elegant but falls short of a full description of the supervised game, and says little about the key player, the generator: for example, what does the generator actually converge to if solving the GAN game means convergence in some space of parameters? How does that provide hints on the generator's designand compare to the flourishing but almost exclusively experimental literature on the subject? In this paper, we unveil a broad class of distributions for which such convergence happens -- namely, deformed exponential families, a wide superset of exponential families --. We show that current deep architectures are able to factorize a very large number of such densities using an especially compact design, hence displaying the power of deep architectures and their concinnity in the f-GAN game. This result holds given a sufficient condition on activation functions -- which turns out to be satisfied by popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures, via (i) a principled design of the activation functions in the generator and (ii) an explicit integration of proper composite losses' link function in the discriminator.

deep learning, exponential family, neural network, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > United Kingdom > England (0.14)
Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

f-GANs in an Information Geometric Nutshell

Nock, Richard, Cranko, Zac, Menon, Aditya Krishna, Qu, Lizhen, Williamson, Robert C.

arXiv.org Machine LearningJul-14-2017

Nowozin \textit{et al} showed last year how to extend the GAN \textit{principle} to all $f$-divergences. The approach is elegant but falls short of a full description of the supervised game, and says little about the key player, the generator: for example, what does the generator actually converge to if solving the GAN game means convergence in some space of parameters? How does that provide hints on the generator's design and compare to the flourishing but almost exclusively experimental literature on the subject? In this paper, we unveil a broad class of distributions for which such convergence happens --- namely, deformed exponential families, a wide superset of exponential families --- and show tight connections with the three other key GAN parameters: loss, game and architecture. In particular, we show that current deep architectures are able to factorize a very large number of such densities using an especially compact design, hence displaying the power of deep architectures and their concinnity in the $f$-GAN game. This result holds given a sufficient condition on \textit{activation functions} --- which turns out to be satisfied by popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures, via (i) a principled design of the activation functions in the generator and (ii) an explicit integration of proper composite losses' link function in the discriminator.

deep learning, exponential family, neural network, (20 more...)

arXiv.org Machine Learning

1707.04385

Country:

Europe > United Kingdom > England (0.14)
Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

Learning with Symmetric Label Noise: The Importance of Being Unhinged

Rooyen, Brendan van, Menon, Aditya, Williamson, Robert C.

Neural Information Processing SystemsDec-31-2015

Convex potential minimisation is the de facto approach to binary classification. However, Long and Servedio [2008] proved that under symmetric label noise (SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly shows that convex losses are not SLN-robust. In this paper, we propose a convex, classification-calibrated loss and prove that it is SLN-robust. The loss avoids the Long and Servedio [2008] result by virtue of being negatively unbounded. The loss is a modification of the hinge loss, where one does not clamp at zero; hence, we call it the unhinged loss. We show that the optimal unhinged solution is equivalent to that of a strongly regularised SVM, and is the limiting solution for any convex potential; this implies that strong l2 regularisation makes most standard learners SLN-robust. Experiments confirm the unhinged loss’ SLN-robustness.

artificial intelligence, machine learning, sln-robustness, (15 more...)

Neural Information Processing Systems

Country: North America > United States > New York (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

An Average Classification Algorithm

van Rooyen, Brendan, Menon, Aditya Krishna, Williamson, Robert C.

arXiv.org Machine LearningDec-15-2015

Many classification algorithms produce a classifier that is a weighted average of kernel evaluations. When working with a high or infinite dimensional kernel, it is imperative for speed of evaluation and storage issues that as few training samples as possible are used in the kernel expansion. Popular existing approaches focus on altering standard learning algorithms, such as the Support Vector Machine, to induce sparsity, as well as post-hoc procedures for sparse approximations. Here we adopt the latter approach. We begin with a very simple classifier, given by the kernel mean $$ f(x) = \frac{1}{n} \sum\limits_{i=i}^{n} y_i K(x_i,x) $$ We then find a sparse approximation to this kernel mean via herding. The result is an accurate, easily parallelized algorithm for learning classifiers.

artificial intelligence, machine learning, preliminary version, (15 more...)

arXiv.org Machine Learning

1506.0152

Country: Oceania > Australia (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.54)

Add feedback

Fast rates in statistical and online learning

van Erven, Tim, Grünwald, Peter D., Mehta, Nishant A., Reid, Mark D., Williamson, Robert C.

arXiv.org Machine LearningSep-1-2015

The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for 'proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting.

artificial intelligence, educational setting, machine learning, (20 more...)

arXiv.org Machine Learning

1507.02592

Country:

Europe > Netherlands (0.28)
Oceania > Australia (0.27)
Europe > United Kingdom > England (0.14)

Genre: Research Report (0.63)

Industry: Education > Educational Setting > Online (0.51)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

A Theory of Feature Learning

van Rooyen, Brendan, Williamson, Robert C.

arXiv.org Machine LearningMar-31-2015

Feature Learning aims to extract relevant information contained in data sets in an automated fashion. It is driving force behind the current deep learning trend, a set of methods that have had widespread empirical success. What is lacking is a theoretical understanding of different feature learning schemes. This work provides a theoretical framework for feature learning and then characterizes when features can be learnt in an unsupervised fashion. We also provide means to judge the quality of features via rate-distortion theory and its generalizations.

deep learning, information, neural network, (18 more...)

arXiv.org Machine Learning

1504.00083

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

From Stochastic Mixability to Fast Rates

Mehta, Nishant A., Williamson, Robert C.

Neural Information Processing SystemsDec-31-2014

Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F},\mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\ell$, $\mathcal{F}$, and $\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss $\ell$ (there being no role there for $\mathcal{F}$ or $\mathsf{P}$). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\ell,\mathcal{F}, \mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.

artificial intelligence, machine learning, stochastic mixability, (16 more...)

Neural Information Processing Systems

Country: Oceania > Australia (0.14)

Industry: Education (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.36)

Add feedback

From Stochastic Mixability to Fast Rates

Mehta, Nishant A., Williamson, Robert C.

arXiv.org Machine LearningNov-22-2014

artificial intelligence, machine learning, stochastic mixability, (18 more...)

arXiv.org Machine Learning

1406.3781

Country: Oceania > Australia (0.14)

Genre:

Workflow (0.46)
Research Report (0.40)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.35)

Add feedback

Generalised Mixability, Constant Regret, and Bayesian Updating

Reid, Mark D., Frongillo, Rafael M., Williamson, Robert C.

arXiv.org Machine LearningMar-10-2014

Mixability of a loss is known to characterise when constant regret bounds are achievable in games of prediction with expert advice through the use of Vovk's aggregating algorithm. We provide a new interpretation of mixability via convex analysis that highlights the role of the Kullback-Leibler divergence in its definition. This naturally generalises to what we call $\Phi$-mixability where the Bregman divergence $D_\Phi$ replaces the KL divergence. We prove that losses that are $\Phi$-mixable also enjoy constant regret bounds via a generalised aggregating algorithm that is similar to mirror descent.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1403.2433

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback