Goto

Collaborating Authors

 speedup


ZipLM: Inference-Aware Structured Pruning of Language Models

Neural Information Processing Systems

The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the post-training/one-shot or the gradual compression setting, and only for specific families of models such as BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERTbase distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERTlarge model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60% smaller and 30% faster.





An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

Neural Information Processing Systems

We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan [17], which is insensitive to the minimum expected loss; over the optimistic acceleration of Cotter et al. [10], which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin [19], which is limited to least squares problems and is also similarly suboptimal with respect to the minibatch size.


Supplementary Material Dynamic Results a)b)c)d)e)f)g)

Neural Information Processing Systems

The different cases represent various material property configurations. In Figure 8 we show the temporal evolution of different scenarios (a) to (d) for the initially straight bending rod, and (e) to (f) for the elastic helix. The default parameters of the initially straight bending rod are 0 =0, N = 30, ` =4 .0 In (b), we modify N 2{ 10,20,40,60}. The default parameters of the elastic helix are HR =0 .5 m (helix radius), HH =0 .5 m (helix height), HW =2 .5 (winding number), T =1 .0



Parallel Sampling of Diffusion Models

Neural Information Processing Systems

Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.1


Circa: Stochastic ReLUs for Private Deep Learning

Neural Information Processing Systems

The simultaneous rise of machine learning as a service and concerns over user privacy have increasingly motivated the need for private inference (PI). While recent work demonstrates PI is possible using cryptographic primitives, the computational overheads render it impractical. State-of-art deep networks are inadequate in this context because the source of slowdown in PI stems from the ReLU operations whereas optimizations for plaintext inference focus on reducing FLOPs. In this paper we re-think ReLU computations and propose optimizations for PI tailored to properties of neural networks. Specifically, we reformulate ReLU as an approximate sign test and introduce a novel truncation method for the sign test that significantly reduces the cost per ReLU. These optimizations result in a specific type of stochastic ReLU. The key observation is that the stochastic fault behavior is well suited for the fault-tolerant properties of neural network inference. Thus, we provide significant savings without impacting accuracy. We collectively call the optimizations Circa and demonstrate improvements of up to 4.7 storage and 3 runtime over baseline implementations; we further show that Circa can be used on top of recent PI optimizations to obtain 1.8 additional speedup.


Quantum Speedups of Optimizing Approximately Convex Functions with Applications to Logarithmic Regret Stochastic Convex Bandits

Neural Information Processing Systems

We initiate the study of quantum algorithms for optimizing approximately convex functions. Given a convex set K Rn and a function F: Rn Rsuch that there exists a convex function f: K R satisfying supx K|F(x) f(x)| /n, our quantum algorithm finds an x K such that F(x) minx KF(x) using O(n3) quantum evaluation queries to F. This achieves a polynomial quantum speedup compared to the best-known classical algorithms. As an application, we give a quantum algorithm for zeroth-order stochastic convex bandits with O(n5 log2 T) regret, an exponential speedup in T compared to the classical Ω( T) lower bound. Technically, we achieve quantum speedup in nby exploiting a quantum framework of simulated annealing and adopting a quantum version of the hit-and-run walk. Our speedup in T for zeroth-order stochastic convex bandits is due to a quadratic quantum speedup in multiplicative error of mean estimation.