Materials
AT ask Setups Table 4: Shared hyperparameters for all models, given for each task
Table 4: Shared hyperparameters for all models, given for each task. Hyperparameter Random Walk Algorithm Reddit/BASE Enwik8 Layers 4 4 8 8 Hidden size 256 256 512 512 Head count 4 4 8 8 Dropout rate 0.2 0.2 0.3 0.3 Embed. We provide the hyperparameter setups shared across our models for each task in Table 4. Random Walk We train 4-layer models with a hidden size of 256 and 4 attention heads. Algorithm We train the 4-layer model with a hidden size of 256 and 4 attention heads. Staircase model which was run 5 times.
A Proof of Theorem 2
We prove the universal approximation theorem by showing the equivalence of TFN and our model. Complex spherical harmonics are related to Clebsch-Gordan coefficients via [51, 3.7.72] We can therefore adapt Eq. (2) by substituting C To see this, we look at the result's real component null [ H To prove this theorem we first introduce a proposition by Villar et al. [57]. GemNet's variance varies strongly between layers and increases significantly after each block without scaling factors (top). We use 4 stacked interaction blocks and an embedding size of 128 throughout the model.
A Omitted proofs
A.3 Formulation of bound constrained dual problem Proposition 1 . For any non-negative p, q, we generate a feasible p ห, ห q as follows. In Section 5.3, we describe that it can be helpful to regularize We also mention here a minor difference in derivations for convenience of readers. As expected, this term also appears in these other formulations [ 25, 42 ]. All experiments run on a single P100 GPU. This adjustment was not necessary for CNN experiments.