input capsule
An Algorithm for Routing Capsules in All Domains
Building on recent work on capsule networks, we propose a new, general-purpose form of "routing by agreement" that activates output capsules in a layer as a function of their net benefit to use and net cost to ignore input capsules from earlier layers. To illustrate the usefulness of our routing algorithm, we present two capsule networks that apply it in different domains: vision and language. The first network achieves new state-of-the-art accuracy of 99.1% on the smallNORB visual recognition task with fewer parameters and an order of magnitude less training than previous capsule models, and we find evidence that it learns to perform a form of "reverse graphics." The second network achieves new state-of-the-art accuracies on the root sentences of the Stanford Sentiment Treebank: 58.5% on fine-grained and 95.6% on binary labels with a single-task model that routes frozen embeddings from a pretrained transformer as capsules. In both domains, we train with the same regime. Code is available at https://github.com/glassroom/heinsen_routing along with replication instructions.
Learning to compute inner consensus -- A noble approach to modeling agreement between Capsules
The now called field of Deep Learning has expanded these ideas by creating models that stack multiple layers of Perceptrons. These Multilayer Perceptrons, commonly known as Neural Networks [7], achieve greater representation capacity, due to the layered manner the computational complexity is added, especially when compared with its precursor. Attributable to this compositional approach they are especially hard-wired to learn a nested hierarchy of concepts [27]. As an approach to soft-computing, Neural Networks stand in opposition to the precisely stated view of analytical algorithms that, unlike the human mind, are not tolerant of imprecision, uncertainty, partial truth and approximation [5]. In conjunction with other Deep Learning models, they stand at the vanguard of Artificial Intelligence Research, employed in tasks that previously have been found computationally intractable.
Information Aggregation for Multi-Head Attention with Routing-by-Agreement
Li, Jian, Yang, Baosong, Dou, Zi-Yi, Wang, Xing, Lyu, Michael R., Tu, Zhaopeng
Multi-head attention is appealing for its ability to jointly extract different types of information from multiple representation subspaces. Concerning the information aggregation, a common practice is to use a concatenation followed by a linear transformation, which may not fully exploit the expressiveness of multi-head attention. In this work, we propose to improve the information aggregation for multi-head attention with a more powerful routing-by-agreement algorithm. Specifically, the routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on linguistic probing tasks and machine translation tasks prove the superiority of the advanced information aggregation over the standard linear transformation.
Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement
Dou, Zi-Yi, Tu, Zhaopeng, Wang, Xing, Wang, Longyue, Shi, Shuming, Zhang, Tong
With the promising progress of deep neural networks, layer aggregation has been used to fuse information across layers in various fields, such as computer vision and machine translation. However, most of the previous methods combine layers in a static fashion in that their aggregation strategy is independent of specific hidden states. Inspired by recent progress on capsule networks, in this paper we propose to use routing-by-agreement strategies to aggregate layers dynamically. Specifically, the algorithm learns the probability of a part (individual layer representations) assigned to a whole (aggregated representations) in an iterative way and combines parts accordingly. We implement our algorithm on top of the state-of-the-art neural machine translation model TRANSFORMER and conduct experiments on the widely-used WMT14 English-German and WMT17 Chinese-English translation datasets. Experimental results across language pairs show that the proposed approach consistently outperforms the strong baseline model and a representative static aggregation model.