Transformers from an Optimization Perspective
Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can view Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
7 Appendix
Figure 5: Comparison of GenStat architecture to selected graph generative models. Observation 1 Suppose that a GGM parameterized by θ is trained with the GenStat architecture and permutationinvariant descriptor functions Φ. Since the loss function is permutation-invariant, so are its gradient updates, which establishes claim 1. This proof uses two properties of LDP: composability and immunity to post-processing [2]. Lemma 1. Collected data (local node-level statistics calculated and perturbed by R) from each node satisfies Figure 6 illustrates the PGM of Randomized algorithms.
Neural Graph Generation from Graph Statistics
We describe a new setting for learning a deep graph generative model (GGM) from aggregate graph statistics, rather than from the graph adjacency matrix. Matching the statistics of observed training graphs is the main approach for learning traditional GGMs (e.g, BTER, Chung-Lu, and Erdos-Renyi models). Privacy researchers have proposed learning from graph statistics as a way to protect privacy. We develop an architecture for training a deep GGM to match statistics while preserving local differential privacy guarantees. Empirical evaluation on 8 datasets indicates that our deep GGM generates more realistic graphs than the traditional non-neural GGMs when both are learned from graph statistics only. We also compare our deep GGM trained on statistics only, to state-of-the-art deep GGMs that are trained on the entire adjacency matrix. The results show that graph statistics are often sufficient to build a competitive deep GGM that generates realistic graphs while protecting local privacy.
Trajectory of Mini Batch Momentum Batch Size Saturation and Convergence in High Dimensions
We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of O(1/ p apple), matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.
Appendices for PLUGIn: A simple algorithm for inverting generative models with recovery guarantees
Here we state some results on Gaussian Matrices, which will be used in the proofs later. Let: R! R be a positively homogeneous activation function. The following theorem is the concentration of (Gaussian) measure inequality for Lipschitz functions. Here we only state a one-sided version, though it is more commonly stated with a two-sided one, i.e., .) A proof of Theorem 2 can be found in [30, Chap.
Predicting Label Distribution from Multi-label Ranking
A.1 Proof of Theorem 1 Theorem 1 If an instance is annotated by a multi-label ranking σ, m is the number of relevant labels, δ and ˆδ are the implicit and explicit margins, respectively, then the EAE of σ is ( The expected approximation error arising from multi-label ranking comes mainly from the relevant labels, hence we only need to consider the relevant labels. It is obvious that Eq. (3) holds for k = 2. Therefore, Eq. (3) holds for k = 2, 3, . It is obvious that Eq. (5) holds for k = 2. Therefore, Eq. (5) holds for k = 2, 3, . Therefore, Eq. (1) can be obtained by combining Eq (7) and V A.2 Proof of Lemma 1 Lemma 1 If an instance is annotated by a multi-label ranking σ, then the margins δ and ˆδ satisfy that 0 δ m A.4 Proof of Corollary 2 Corollary 2 If an instance is annotated by a multi-label ranking σ, m is the number of relevant labels, 0 δ m A.5 Proof of Theorem 2 Theorem 2 If an instance is annotated by a logical label vector l, m is the number of relevant labels, δ and ˆδ are the implicit and explicit margins, respectively, then the EAE of l is ɛ The expected approximation error arising from logical labels comes mainly from labels with a logical value of 1, hence we consider only the relevant labels. A.6 Proof of Corollary 3 Corollary 3 If an instance is annotated by a multi-label ranking σ, m is the number of relevant labels, δ and ˆδ are uniform over [ 0, m A.8 Details of DRAM The probability density function of Dirichlet distribution is The first four rows in Table 1 are the existing label distribution datasets; the last three rows in Table 1 are the datasets we created. Since some examples in the original label distribution datasets do not satisfy the prerequisites of our paper (i.e., there are some examples (x, d) such that there exist relevant labels with identical label description degrees), we remove these examples from the dataset to obtain such a dataset: {(x, d) D| (d Since the instances in Emotion6, Twitter-LDL and Flickr-LDL are images, we use a VGG16 [2] network pre-trained on ImageNet [1] to extract 1000-dimensional features.
Predicting Label Distribution from Multi-label Ranking
Label distribution can provide richer information about label polysemy than logical labels in multi-label learning. There are currently two strategies including LDL (label distribution learning) and LE (label enhancement) to predict label distributions. LDL requires experts to annotate instances with label distributions and learn a predictive mapping on such a training set. LE requires experts to annotate instances with logical labels and generates label distributions from them. However, LDL requires costly annotation, and the performance of the LE is unstable. In this paper, we study the problem of predicting label distribution from multi-label ranking which is a compromise w.r.t.
A Proof of Theorems
The proof of the key lemma (Lemma 5), which establishes a connection between the margin operator and the robust margin operator, is presented in the main content. We still need to demonstrate that the properties in PAC-Bayes analysis hold for both the margin operator and the robust margin operator. The following proofs are adapted from the work of (Neyshabur et al., 2017b), with the steps being kept independent of the (robust) margin operator. We will begin by finishing the proofs of Lemma 6 and Lemma 7. Afterward, we will proceed to complete the proof of Theorem 1, which is our primary result. It is provided in (Neyshabur et al., 2017b), we Then we complete the proof of Lemma 6.1. By combining Lemma 6.1 and Lemma 5, we directly obtain Lemma 6.2.