Goto

Collaborating Authors

 total number






A Supplementary Analysis

Neural Information Processing Systems

To evaluate TSLD's efficiency, we detail training speeds and GPU memory consumption for various Our analysis of confidence disparity in token predictions, detailed in Section 4.2, extends beyond a In fact, this observed trend is consistently present across various GLM models. These errors are visualized using a heatmap plot (Fig. A2 top), For the OPT -6.7B model, quantization error is measured for the 5th and 15th layers. LLaMA-7B model, quantization errors are depicted for input sequence lengths of 128 and 512. From left to right: OPT -6.7B, LLaMA-7B, and LLaMA-2-7B. However, as we delve deeper into the layers of OPT -6.7B or introduce longer input sequences to LLaMA-7B, this phenomenon becomes less pronounced.





More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

Meir, Sagi, Keidar, Tommer D., Levi, Noam, Reuveni, Shlomi, Hirshberg, Barak

arXiv.org Machine Learning

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.


Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

Neural Information Processing Systems

We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a direct analysis of the ergodicity of the numerical approximations to Langevin dynamics, which leads to faster convergence rates. Specifically, we show that gradient Langevin dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to the \textit{almost minimizer}\footnote{Following \citet{raginsky2017non}, an almost minimizer is defined to be a point which is within the ball of the global minimizer with radius $O(d\log(\beta+1)/\beta)$, where $d$ is the problem dimension and $\beta$ is the inverse temperature parameter.}