Storkey, Amos

Deep Kernel Transfer in Gaussian Processes for Few-shot Learning Machine Learning

Here, we use the nomenclature derived from the meta-learning literature which is the most prevalent at time of writing. Let S {( x l,y l)} L l 1 be a support-set containing input-output pairs, with L equal to one (1-shot) or five (5-shot), and Q { (x m,y m)} M m 1be a query-set (sometimes referred to in the literature as a target-set), with M typically one order of magnitude greater than L. For ease of notation, the support and query sets are grouped in a task T {S, Q}, with the dataset D {T n} N n 1 defined as a collection of such tasks. Models are trained on random tasks sampled from D . Then, given a new task T {S, Q } sampled from a test set, the objective is to condition the model on the samples of the support S to estimate the membership of the samples in the query set Q . In the most common scenario, the inputs x D belong to the same distribution p(x) and are distributed across training, validation, and test sets such that their class membership is non-overlapping. Note that y can be a continuous value (regression) or a discrete one (classification), even though most of the previous work has focused on classification. We also consider the cross-domain scenario, where the inputs are sampled from different distributions at training and test time; this is more representative of real-world scenarios.

Learning to learn via Self-Critique Machine Learning

In few-shot learning, a machine learning system learns from a small set of labelled examples relating to a specific task, such that it can generalize to new examples of the same task. Given the limited availability of labelled examples in such tasks, we wish to make use of all the information we can. Usually a model learns task-specific information from a small training-set (support-set) to predict on an unlabelled validation set (target-set). The target-set contains additional task-specific information which is not utilized by existing few-shot learning methods. Making use of the target-set examples via transductive learning requires approaches beyond the current methods; at inference time, the target-set contains only unlabelled input data-points, and so discriminative learning cannot be used. In this paper, we propose a framework called Self-Critique and Adapt or SCA, which learns to learn a label-free loss function, parameterized as a neural network. A base-model learns on a support-set using existing methods (e.g. stochastic gradient descent combined with the cross-entropy loss), and then is updated for the incoming target-task using the learnt loss function. This label-free loss function is itself optimized such that the learnt model achieves higher generalization performance. Experiments demonstrate that SCA offers substantially reduced error-rates compared to baselines which only adapt on the support-set, and results in state of the art benchmark performance on Mini-ImageNet and Caltech-UCSD Birds 200.

BlockSwap: Fisher-guided Block Substitution for Network Compression Machine Learning

The desire to run neural networks on low-capacity edge devices has led to the development of a wealth of compression techniques. Moonshine (Crowley et al., 2018a) is a simple and powerful example of this: one takes a large pre-trained network and substitutes each of its convolutional blocks with a selected cheap alternative block, then distills the resultant network with the original. However, not all blocks are created equally; for a required parameter budget there may exist a potent combination of many different cheap blocks. In this work, we find these by developing BlockSwap: an algorithm for choosing networks with interleaved block types by passing a single minibatch of training data through randomly initialised networks and gauging their Fisher potential. We show that block-wise cheapening yields more accurate networks than single block-type networks across a spectrum of parameter budgets. Code is available at

Separable Layers Enable Structured Efficient Linear Substitutions Machine Learning

In response to the development of recent efficient dense layers, this paper shows that something as simple as replacing linear components in pointwise convolutions with structured linear decompositions also produces substantial gains in the efficiency/accuracy tradeoff. Pointwise convolutions are fully connected layers and are thus prepared for replacement by structured transforms. Networks using such layers are able to learn the same tasks as those using standard convolutions, and provide Pareto-optimal benefits in efficiency/accuracy, both in terms of computation (mult-adds) and parameter count (and hence memory).

Zero-shot Knowledge Transfer via Adversarial Belief Matching Machine Learning

Performing knowledge transfer from a large teacher network to a smaller student is a popular task in modern deep learning applications. However, due to growing dataset sizes and stricter privacy regulations, it is increasingly common not to have access to the data that was used to train the teacher. We propose a novel method which trains a student to match the predictions of its teacher without using any data or metadata. We achieve this by training an adversarial generator to search for images on which the student poorly matches the teacher, and then using them to train the student. Our resulting student closely approximates its teacher for simple datasets like SVHN, and on CIFAR10 we improve on the state-of-the-art for few-shot distillation (with 100 images per class), despite using no data. Finally, we also propose a metric to quantify the degree of belief matching between teacher and student in the vicinity of decision boundaries, and observe a significantly higher match between our zero-shot student and the teacher, than between a student distilled with real data and the teacher. Code available at:

Assume, Augment and Learn: Unsupervised Few-Shot Meta-Learning via Random Labels and Data Augmentation Machine Learning

The field of few-shot learning has been laboriously explored in the supervised setting, where per-class labels are available. On the other hand, the unsupervised few-shot learning setting, where no labels of any kind are required, has seen little investigation. We propose a method, named Assume, Augment and Learn or AAL, for generating few-shot tasks using unlabeled data. We randomly label a random subset of images from an unlabeled dataset to generate a support set. Then by applying data augmentation on the support set's images, and reusing the support set's labels, we obtain a target set. The resulting few-shot tasks can be used to train any standard meta-learning framework. Once trained, such a model, can be directly applied on small real-labeled datasets without any changes or fine-tuning required. In our experiments, the learned models achieve good generalization performance in a variety of established few-shot learning tasks on Omniglot and Mini-Imagenet.

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length Machine Learning

The training of deep neural networks with Stochastic Gradient Descent (SGD) with a large learning rate or a small batch-size typically ends in flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. This was found to correlate with a good final generalization performance. In this paper we extend previous work by investigating the curvature of the loss surface along the whole training trajectory, rather than only at the endpoint. We find that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. At this peak value SGD starts to fail to minimize the loss along directions in the loss surface corresponding to the largest curvature (sharpest directions). To further investigate the effect of these dynamics in the training process, we study a variant of SGD using a reduced learning rate along the sharpest directions which we show can improve training speed while finding both a sharper and better generalizing solution, compared to vanilla SGD. Overall, our results show that the SGD dynamics in the subspace of the sharpest directions influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model.

Dilated DenseNets for Relational Reasoning Machine Learning

Despite their impressive performance in many tasks, deep neural networks often struggle at relational reasoning. This has recently been remedied with the introduction of a plug-in relational module that considers relations between pairs of objects. Unfortunately, this is combinatorially expensive. In this extended abstract, we show that a DenseNet incorporating dilated convolutions excels at relational reasoning on the Sort-of-CLEVR dataset, allowing us to forgo this relational module and its associated expense.

Exploration by Random Network Distillation Artificial Intelligence

We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.

HAKD: Hardware Aware Knowledge Distillation Machine Learning

Despite recent developments, deploying deep neural networks on resource constrained general purpose hardware remains a significant challenge. There has been much work in developing methods for reshaping neural networks, usually with a focus on minimising total parameter count. These methods are typically developed in a hardware-agnostic manner and do not exploit hardware behaviour. In this paper we propose a new approach, Hardware Aware Knowledge Distillation (HAKD) which uses empirical observations of hardware behaviour to design efficient student networks which are then trained with knowledge distillation. This allows the trade-off between accuracy and performance to be managed explicitly. We have applied this approach across three platforms and evaluated it on two networks, MobileNet and DenseNet, on CIFAR-10. We show that HAKD outperforms Deep Compression and Fisher pruning in terms of size, accuracy and performance.