Overview
Supplementary Material In this supplementary, we first provide an overview of our proof techniques in Appendix A and then
Our analysis of the generalization error is based on an extension of Gordon's Gaussian process inequality [ R is a continuous function, which is convex in the first argument and concave in the second argument. The main result of CGMT is to connect the above two random optimization problems. The CGMT framework has been used to infer statistical properties of estimators in certain high-dimensional asymptotic regime. Second, derive the point-wise limit of the AO objective in terms of a convex-concave optimization problem, over only few scalar variables.
Block Transformer: Global-to-Local Language Modeling for Fast Inference
We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache.