Goto

Collaborating Authors

 South America






Block Transformer: Global-to-Local Language Modeling for Fast Inference

Neural Information Processing Systems

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache.


dist(x,y) andavg(A,B) = 1 |A| |B| X

Neural Information Processing Systems

In this paper, we present a comprehensive study of the performance of average-link in metric spaces, regarding several natural criteria that capture separability and cohesion, and aremore interpretable than Dasgupta'scost function and itsvariants.