BayesianAttentionModules: Appendix AAlgorithm

Neural Information Processing Systems 

Then softmax is applied to obtain probabilities. Totunethehyperparameters in BAM, we randomly hold out20% of the training set for validation. The vocabulary sizeV is 9488 and the max captionlengthT is16. During training, weuseMLElossonlywithout scheduled sampling or RLloss. At the stepj of decoding, current LSTM state x (a function of previous target words y1:j 1) is used as query.