Budzinskiy, Stanislav
Numerical Error Analysis of Large Language Models
Budzinskiy, Stanislav, Fang, Wenyi, Zeng, Longbin, Petersen, Philipp
Large language models based on transformer architectures have become integral to state-of-the-art natural language processing applications. However, their training remains computationally expensive and exhibits instabilities, some of which are expected to be caused by finite-precision computations. We provide a theoretical analysis of the impact of round-off errors within the forward pass of a transformer architecture which yields fundamental bounds for these effects. In addition, we conduct a series of numerical experiments which demonstrate the practical relevance of our bounds. Our results yield concrete guidelines for choosing hyperparameters that mitigate round-off errors, leading to more robust and stable inference.
When big data actually are low-rank, or entrywise approximation of certain function-generated matrices
Budzinskiy, Stanislav
The article concerns low-rank approximation of matrices generated by sampling a smooth function of two $m$-dimensional variables. We refute an argument made in the literature that, for a specific class of analytic functions, such matrices admit accurate entrywise approximation of rank that is independent of $m$. We provide a theoretical explanation of the numerical results presented in support of this argument, describing three narrower classes of functions for which $n \times n$ function-generated matrices can be approximated within an entrywise error of order $\varepsilon$ with rank $\mathcal{O}(\log(n) \varepsilon^{-2} \mathrm{polylog}(\varepsilon^{-1}))$ that is independent of the dimension $m$: (i) functions of the inner product of the two variables, (ii) functions of the squared Euclidean distance between the variables, and (iii) shift-invariant positive-definite kernels. We extend our argument to low-rank tensor-train approximation of tensors generated with functions of the multi-linear product of their $m$-dimensional variables. We discuss our results in the context of low-rank approximation of attention in transformer neural networks.
Tensor train completion: local recovery guarantees via Riemannian optimization
Budzinskiy, Stanislav, Zamarashkin, Nikolai
The problem of recovering algebraically structured data from scarce measurements has already become a classic one. The data under consideration are typically sparse vectors or low-rank matrices and tensors, while the measurements are obtained by applying a linear operator that satisfies a variant of the so-called restricted isometry property (RIP) [1]. In this work, we focus on tensor completion, which consists in recovering a tensor in the tensor train (TT) format [2, 3] from a small subset of its entries. Specifically, we consider it as a Riemannian optimization problem [4, 5] on the smooth manifold of tensors with fixed TT ranks and derive sufficient conditions (essentially, the RIP) for local convergence of the Riemannian gradient descent. We further estimate the number of randomly selected entries of a tensor with low TT ranks that is sufficient for the RIP to hold with high probability and, as a consequence, for the Riemannian gradient descent to converge locally.
Variational Bayesian inference for CP tensor completion with side information
Budzinskiy, Stanislav, Zamarashkin, Nikolai
We propose a message passing algorithm, based on variational Bayesian inference, for low-rank tensor completion with automatic rank determination in the canonical polyadic format when additional side information (SI) is given. The SI comes in the form of low-dimensional subspaces the contain the fiber spans of the tensor (columns, rows, tubes, etc.). We validate the regularization properties induced by SI with extensive numerical experiments on synthetic and real-world data and present the results about tensor recovery and rank determination. The results show that the number of samples required for successful completion is significantly reduced in the presence of SI. We also discuss the origin of a bump in the phase transition curves that exists when the dimensionality of SI is comparable with that of the tensor.