Goto

Collaborating Authors

 sparsemax


Rankmax: An Adaptive Projection Alternative to the Softmax Function Supplementary Material

Neural Information Processing Systems

This document consists of results that support the material in the paper " Rankmax: An Adaptive Projection Alternative to the Softmax Function ", hereafter referred to


A Regularized Framework for Sparse and Structured Neural Attention

Neural Information Processing Systems

Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max operator. We show that the gradient of this operator defines a mapping from real values to probabilities, suitable as an attention mechanism. Our framework includes softmax and a slight generalization of the recently-proposed sparsemax as special cases. However, we also show how our framework can incorporate modern structured penalties, resulting in more interpretable attention mechanisms, that focus on entire segments or groups of an input. We derive efficient algorithms to compute the forward and backward passes of our attention mechanisms, enabling their use in a neural network trained with backpropagation. To showcase their potential as a drop-in replacement for existing ones, we evaluate our attention mechanisms on three large-scale tasks: textual entailment, machine translation, and sentence summarization. Our attention mechanisms improve interpretability without sacrificing performance; notably, on textual entailment and summarization, we outperform the standard attention mechanisms based on softmax and sparsemax.




Supplemental Material A Differential Negentropy and Boltzmann-Gibbs distributions We adapt a proof from Cover and Thomas

Neural Information Processing Systems

"An important question is then whether in the modification the normalization should stand in front of the deformed exponential function, or whether it should be included as " Throughout our paper, we use the definition of [10, 25], equivalent to the maxent problem (27). Since each slice of a paraboloid is an ellipsoid, we can apply Cavalieri's principle to obtain the volume of a paraboloid N (t; 0, 1) = 1 2 null erf null v 2 null erf null u 2 nullnull v N (v; 0, 1) + uN ( u; 0, 1), (50) from which the expectation (49) can be computed directly. We start with the following lemma:Lemma 1. Applying Fubini's theorem, we fix The training and test sets are perfectly balanced: 12.5K negative and The documents have 280 words on average. Figure 4 illustrates the difficulties that continuous attention models may face when trying to focus on objects that are too far from each other or that seem to have different relative importance to answer the question. Batch size 64 Word embeddings size 300 Input image features size 2048 Input question features size 512 Fused multimodal features size 1024 Multi-head attention hidden size 512 Number of MCA layers 6 Number of attention heads 8 Dropout rate 0.1 MLP size in flatten layers 512 Optimizer Adam Base learning rate at epoch t starting from 1 min(2.