Attention that does not Explain Away
Ding, Nan, Fan, Xinjie, Lan, Zhenzhong, Schuurmans, Dale, Soricut, Radu
This performance in a variety of machine learning is because for a GMM, not all Gaussian centers tasks, such as machine translation (Vaswani et al., (lower layer neurons) are required to contribute in 2017; Dehghani et al., 2019), language modeling generating output data (upper layer neurons). The (Devlin et al., 2019; Yang et al., 2019), summarization information of the centers that do not generate data (Cohan et al., 2018; Goodman et al., 2019), is lost after observing the data. This "explainingaway" dialog (Mazaré et al., 2018; Cheng et al., 2019), effect is related to the one in the directed image captioning (Sharma et al., 2018; Zhao et al., graphical model, in the sense that the existence of 2019), and visual question answering (Yu et al., the few contributed lower neurons "explain away" 2019b; Tan and Bansal, 2019). One of the most important the other muted lower neurons on generating upper components of the Transformer architecture neurons. is its self-attention mechanism, applied universally In order to compensate for this, we describe to both the encoder and the decoder components.
Sep-29-2020