Block Transformer: Global-to-Local Language Modeling for Fast Inference

Feb-13-2026, 20:40:33 GMT–Neural Information Processing Systems

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache.

decoder, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Feb-13-2026, 20:40:33 GMT

Conferences PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe > United Kingdom
  - Wales (0.04)
  - England (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Singapore (0.04)
  - Indonesia > Bali (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - China > Liaoning Province
    - Shenyang (0.04)

Genre:
- Overview (0.92)
- Research Report
  - New Finding (1.00)
  - Experimental Study (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
579b5b84311e122584dedab1d3b7613f-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found