sparsemax
A Regularized Framework for Sparse and Structured Neural Attention
Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max operator. We show that the gradient of this operator defines a mapping from real values to probabilities, suitable as an attention mechanism. Our framework includes softmax and a slight generalization of the recently-proposed sparsemax as special cases. However, we also show how our framework can incorporate modern structured penalties, resulting in more interpretable attention mechanisms, that focus on entire segments or groups of an input. We derive efficient algorithms to compute the forward and backward passes of our attention mechanisms, enabling their use in a neural network trained with backpropagation. To showcase their potential as a drop-in replacement for existing ones, we evaluate our attention mechanisms on three large-scale tasks: textual entailment, machine translation, and sentence summarization. Our attention mechanisms improve interpretability without sacrificing performance; notably, on textual entailment and summarization, we outperform the standard attention mechanisms based on softmax and sparsemax.
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Plymouth County > Hanover (0.04)
- (4 more...)
Supplemental Material A Differential Negentropy and Boltzmann-Gibbs distributions We adapt a proof from Cover and Thomas
"An important question is then whether in the modification the normalization should stand in front of the deformed exponential function, or whether it should be included as " Throughout our paper, we use the definition of [10, 25], equivalent to the maxent problem (27). Since each slice of a paraboloid is an ellipsoid, we can apply Cavalieri's principle to obtain the volume of a paraboloid N (t; 0, 1) = 1 2 null erf null v 2 null erf null u 2 nullnull v N (v; 0, 1) + uN ( u; 0, 1), (50) from which the expectation (49) can be computed directly. We start with the following lemma:Lemma 1. Applying Fubini's theorem, we fix The training and test sets are perfectly balanced: 12.5K negative and The documents have 280 words on average. Figure 4 illustrates the difficulties that continuous attention models may face when trying to focus on objects that are too far from each other or that seem to have different relative importance to answer the question. Batch size 64 Word embeddings size 300 Input image features size 2048 Input question features size 512 Fused multimodal features size 1024 Multi-head attention hidden size 512 Number of MCA layers 6 Number of attention heads 8 Dropout rate 0.1 MLP size in flatten layers 512 Optimizer Adam Base learning rate at epoch t starting from 1 min(2.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Massachusetts > Plymouth County > Hanover (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (2 more...)
- Europe > Portugal > Lisbon > Lisbon (0.05)
- North America > United States > Texas > Schleicher County (0.04)
- North America > Canada (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Asia > Middle East > Israel (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Europe > Portugal > Lisbon > Lisbon (0.05)
- Asia > Middle East > Jordan (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (3 more...)