FastTransformerswithClusteredAttention SupplementaryMaterial

Neural Information Processing Systems 

WefirstclusterthequeriesQusingtheK-means clustering to outputS which indicates the membership of queries to different clusters. The lower half of the figure shows the new valueˆVt computed by sparse dot-products with the keysK and values V corresponding tothe the top-k keys inT. Figure 6: We show training/validation loss convergence for different transformer variants. Both the clustered variants are have a significantly better convergence than bothlsh-1 and lsh-4. Note that due to a smaller batch sizefullmakesmanymoreupdates than allother transformer variants. In figure 6a, we show the training loss convergence for different transformer variants.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found