To enable wider use of this powerful deep learning architecture, we propose two new methods. The first, adaptive attention span is a way to make Transformer networks more efficient for longer sentences. With this method, we were able to increase the attention span of a Transformer to over 8,000 tokens without significantly increasing computation time or memory footprint. The second, all-attention layer is a way to simplify the model architecture of Transformer networks. Even with a much simpler architecture, our all-attention network matched the state-of-the-art performance of Transformer networks.
Convolutional neural networks have proven to be a powerful tool for image recognition, allowing for ever-improving results in image classification (ImageNet), object detection (COCO), and other tasks. Despite their success, convolutions are limited by their locality, i.e. their inability to consider relations between different areas of an image. On the other hand, a popular mechanism which has proven success in overcoming locality is self-attention, which has shown to be able to capture long-range interactions (e.g. In a recent paper, Attention Augmented Convolutional Networks (AACN), a team from Google Brain presents a new way to add self-attention to common Computer Vision algorithms. By combining convolutional layers and self-attention layers in a ResNet architecture, the researchers were able to achieve top results in image classification and object detection while requiring a smaller model than non-attention ResNet models.
Masanari Kimura 1, Masayuki T anaka 1,2 1 National Institute of Advanced Industrial Science and Technology 2 Tokyo Institute of Technology firstname.lastname@example.org email@example.com Abstract Deep neural networks (DNNs) are known as black-box models. In other words, it is difficult to interpret the internal state of the model. Improving the interpretability of DNNs is one of the hot research topics. However, at present, the definition of interpretability for DNNs is vague, and the question of what is a highly explanatory model is still controversial. To address this issue, we provide the definition of the human predictability of the model, as a part of the interpretability of the DNNs. The human predictability proposed in this paper is defined by easiness to predict the change of the inference when perturbating the model of the DNNs. In addition, we introduce one example of high human-predictable DNNs. We discuss that our definition will help to the research of the in-terpretability of the DNNs considering various types of applications. Introduction In recent years, Deep Neural Networks (DNNs) have achieved great success in a number of tasks (Deng et al. 2009; Liu et al. 2017).
Central to models of human visual attention is the saliency map. We propose a hierarchical visual architecture that operates on a saliency map and uses a novel attention mechanism to sequentially focus on salient regions and take additional glimpses within those regions. The architecture is motivated by human visual attention, and is used for multi-label image classification on a novel multiset task, demonstrating that it achieves high precision and recall while localizing objects with its attention. Unlike conventional multi-label image classification models, the model supports multiset prediction due to a reinforcement-learning based training process that allows for arbitrary label permutation and multiple instances per label. Papers published at the Neural Information Processing Systems Conference.