Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

Open in new window