Neural Networks play a very important role when modeling unstructured data such as in Language or Image processing. The idea of such networks is to simulate the structure of the brain using nodes and edges with numerical weights processed by activation functions. The output of such networks mostly yield a prediction, such as a classification. This is achieved by optimizing on a given target using some optimisation loss function. In a previous post, we already discussed the importance of customizing this loss function, for the case of gradient boosting trees. In this post, we shall discuss how to customize the optimizers to speed-up and improve the process of finding a (local) minimum of the loss function.
Bayesian optimization is a sample-efficient method for finding a global optimum of an expensive-to-evaluate black-box function. A global solution is found by accumulating a pair of query point and corresponding function value, repeating these two procedures: (i) learning a surrogate model for the objective function using the data observed so far; (ii) the maximization of an acquisition function to determine where next to query the objective function. Convergence guarantees are only valid when the global optimizer of the acquisition function is found and selected as the next query point. In practice, however, local optimizers of acquisition functions are also used, since searching the exact optimizer of the acquisition function is often a non-trivial or time-consuming task. In this paper we present an analysis on the behavior of local optimizers of acquisition functions, in terms of instantaneous regrets over global optimizers. We also present the performance analysis when multi-started local optimizers are used to find the maximum of the acquisition function. Numerical experiments confirm the validity of our theoretical analysis.
Deep neural networks frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both model size and inference time without appreciable loss in accuracy. Compressing models before they are deployed can therefore result in significantly more efficient systems. However, while the results are desirable, finding the best compression strategy for a given neural network, target platform, and optimization objective often requires extensive experimentation. Moreover, finding optimal hyperparameters for a given compression strategy typically results in even more expensive, frequently manual, trial-and-error exploration. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build complex compression strategies. Given a strategy and a user-provided objective, such as minimization of running time, Condensa uses a novel sample-efficient constrained Bayesian optimization algorithm to automatically infer desirable sparsity ratios. Our experiments on three real-world image classification and language modeling tasks demonstrate memory footprint reductions of up to 65x and runtime throughput improvements of up to 2.22x using at most 10 samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa.
Neural Networks play a very important role when modeling unstructured data such as in Language or Image processing. The idea of such networks is to simulate the structure of the brain using nodes and edges with numerical weights processed by activation functions. The output of such networks mostly yield a prediction, such as a classification. This is achieved by optimizing on a given target using some optimisation loss function. In a previous post, we already discussed the importance of customizing this loss function, for the case of gradient boosting trees.
In certain applications the objective function is expensive or difficult to evaluate. In these situations, a general approach consists in creating a simpler surrogate model of the objective function which is cheaper to evaluate and will be used instead to solve the optimization problem. Moreover, due to the high cost of evaluating the objective function, an iterative approach is often recommended. Iterative optimizers work by iteratively requesting evaluations of the function at a sequence of points in the domain. Bayesian Optimization adds a Bayesian methodology to the iterative optimizer paradigm by incorporating a prior model on the space of possible target functions.