Loki: Low-rank Keys for Efficient Sparse Attention

Neural Information Processing Systems 

In particular, the self-attention mechanism used in LLM inference contributes significantly to these costs, which has sparked an interest in approximating the self-attention computation to reduce such costs.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found