Plotting

 McWilliams, Brian


Deep Scattering: Rendering Atmospheric Clouds with Radiance-Predicting Neural Networks

arXiv.org Machine Learning

We present a technique for efficiently synthesizing images of atmospheric clouds using a combination of Monte Carlo integration and neural networks. The intricacies of Lorenz-Mie scattering and the high albedo of cloud-forming aerosols make rendering of clouds---e.g. the characteristic silverlining and the "whiteness" of the inner body---challenging for methods based solely on Monte Carlo integration or diffusion theory. We approach the problem differently. Instead of simulating all light transport during rendering, we pre-learn the spatial and directional distribution of radiant flux from tens of cloud exemplars. To render a new scene, we sample visible points of the cloud and, for each, extract a hierarchical 3D descriptor of the cloud geometry with respect to the shading location and the light source. The descriptor is input to a deep neural network that predicts the radiance function for each shading configuration. We make the key observation that progressively feeding the hierarchical descriptor into the network enhances the network's ability to learn faster and predict with high accuracy while using few coefficients. We also employ a block design with residual connections to further improve performance. A GPU implementation of our method synthesizes images of clouds that are nearly indistinguishable from the reference solution within seconds interactively. Our method thus represents a viable solution for applications such as cloud design and, thanks to its temporal stability, also for high-quality production of animated content.


Preserving Differential Privacy Between Features in Distributed Estimation

arXiv.org Machine Learning

Privacy is crucial in many applications of machine learning. Legal, ethical and societal issues restrict the sharing of sensitive data making it difficult to learn from datasets that are partitioned between many parties. One important instance of such a distributed setting arises when information about each record in the dataset is held by different data owners (the design matrix is "vertically-partitioned"). In this setting few approaches exist for private data sharing for the purposes of statistical estimation and the classical setup of differential privacy with a "trusted curator" preparing the data does not apply. We work with the notion of $(\epsilon,\delta)$-distributed differential privacy which extends single-party differential privacy to the distributed, vertically-partitioned case. We propose PriDE, a scalable framework for distributed estimation where each party communicates perturbed random projections of their locally held features ensuring $(\epsilon,\delta)$-distributed differential privacy is preserved. For $\ell_2$-penalized supervised learning problems PriDE has bounded estimation error compared with the optimal estimates obtained without privacy constraints in the non-distributed setting. We confirm this empirically on real world and synthetic datasets.


The Shattered Gradients Problem: If resnets are the answer, then what is the question?

arXiv.org Machine Learning

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. The problem has largely been overcome through the introduction of carefully constructed initializations and batch normalization. Nevertheless, architectures incorporating skip-connections such as resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise. In contrast, the gradients in architectures with skip-connections are far more resistant to shattering decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering. Preliminary experiments show the new initialization allows to train very deep networks without the addition of skip-connections.


Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks

arXiv.org Machine Learning

Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. Nevertheless, methods from convex optimization such as gradient descent and Adam are widely used as building blocks for deep learning algorithms. This paper provides the first convergence guarantee applicable to modern convnets. The guarantee matches a lower bound for convex nonsmooth functions. The key technical tool is the neural Taylor approximation -- a straightforward application of Taylor expansions to neural networks -- and the associated Taylor loss. Experiments on a range of optimizers, layers, and tasks provide evidence that the analysis accurately captures the dynamics of neural optimization. The second half of the paper applies the Taylor approximation to isolate the main difficulty in training rectifier nets: that gradients are shattered. We investigate the hypothesis that, by exploring the space of activation configurations more thoroughly, adaptive optimizers such as RMSProp and Adam are able to converge to better solutions.


Scalable Adaptive Stochastic Optimization Using Random Projections

Neural Information Processing Systems

Adaptive stochastic gradient methods such as ADAGRAD have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of ADAGRAD is expected to attain better performance, however in high dimensions it is computationally impractical. We present ADA-LR and RADAGRAD two computationally efficient approximations to full-matrix ADAGRAD based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix ADAGRAD but at a much smaller computational cost. We show that the regret of ADA-LR is close to the regret of full-matrix ADAGRAD which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that ADA-LR and RADAGRAD perform similarly to full-matrix ADAGRAD. On the task of training convolutional neural networks as well as recurrent neural networks, RADAGRAD achieves faster convergence than diagonal ADAGRAD.


Scalable Adaptive Stochastic Optimization Using Random Projections

arXiv.org Machine Learning

Adaptive stochastic gradient methods such as AdaGrad have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of AdaGrad is expected to attain better performance, however in high dimensions it is computationally impractical. We present Ada-LR and RadaGrad two computationally efficient approximations to full-matrix AdaGrad based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix AdaGrad but at a much smaller computational cost. We show that the regret of Ada-LR is close to the regret of full-matrix AdaGrad which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that Ada-LR and RadaGrad perform similarly to full-matrix AdaGrad. On the task of training convolutional neural networks as well as recurrent neural networks, RadaGrad achieves faster convergence than diagonal AdaGrad.


Variance Reduced Stochastic Gradient Descent with Neighbors

arXiv.org Machine Learning

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its slow convergence can be a computational bottleneck. Variance reduction techniques such as SAG, SVRG and SAGA have been proposed to overcome this weakness, achieving linear convergence. However, these methods are either based on computations of full gradients at pivot points, or on keeping per data point corrections in memory. Therefore speed-ups relative to SGD may need a minimal number of epochs in order to materialize. This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. As a side-product we provide a unified convergence analysis for a family of variance reduction algorithms, which we call memorization algorithms. We provide experimental results supporting our theory.


DUAL-LOCO: Distributing Statistical Estimation Using Random Projections

arXiv.org Machine Learning

We present DUAL-LOCO, a communication-efficient algorithm for distributed statistical estimation. DUAL-LOCO assumes that the data is distributed according to the features rather than the samples. It requires only a single round of communication where low-dimensional random projections are used to approximate the dependences between features available to different workers. We show that DUAL-LOCO has bounded approximation error which only depends weakly on the number of workers. We compare DUAL-LOCO against a state-of-the-art distributed optimization method on a variety of real world datasets and show that it obtains better speedups while retaining good accuracy.


Variance Reduced Stochastic Gradient Descent with Neighbors

Neural Information Processing Systems

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet it is also known to be slow relative to steepest descent. Recently, variance reduction techniques such as SVRG and SAGA have been proposed to overcome this weakness. With asymptotically vanishing variance, a constant step size can be maintained, resulting in geometric convergence rates. However, these methods are either based on occasional computations of full gradients at pivot points (SVRG), or on keeping per data point corrections in memory (SAGA). This has the disadvantage that one cannot employ these methods in a streaming setting and that speed-ups relative to SGD may need a certain number of epochs in order to materialize. This paper investigates a new class of algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points. While not meant to be offering advantages in an asymptotic setting, there are significant benefits in the transient optimization phase, in particular in a streaming or single-epoch setting. We investigate this family of algorithms in a thorough analysis and show supporting experimental results. As a side-product we provide a simple and unified proof technique for a broad class of variance reduction algorithms.


LOCO: Distributing Ridge Regression with Random Projections

arXiv.org Machine Learning

We propose LOCO, an algorithm for large-scale ridge regression which distributes the features across workers on a cluster. Important dependencies between variables are preserved using structured random projections which are cheap to compute and must only be communicated once. We show that LOCO obtains a solution which is close to the exact ridge regression solution in the fixed design setting. We verify this experimentally in a simulation study as well as an application to climate prediction. Furthermore, we show that LOCO achieves significant speedups compared with a state-of-the-art distributed algorithm on a large-scale regression problem.