A Method details 476 A.1 Categorical attention

Neural Information Processing Systems 

As described in Section 3.2, we implement categorical attention by associating each attention head In this example, an attention head ( left) calculates the histogram for each position. This allows us to compress the corresponding function. Illustrative programs are depicted in Figures 8 and 9 . This is illustrated in Figure 9 . In this section we describe additional implementation details for the experiments in Section 4 .W e We train each model for 250 epochs with a batch size of 512, a learning rate of 0.05, and We take one Gumbel sample per step.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found