LASH A

Neural Information Processing Systems 

Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found