Goto

Collaborating Authors

 relu activation function



Polyhedron Attention Module: Learning Adaptive-order Interactions Anonymous Author(s) Affiliation Address email Appendixes1

Neural Information Processing Systems

Contents2 ADeriving Eq. 2. 23 BThe hyperplane set generated by the oblique tree is a superset of that created by the4 ReLU-activated plain DNN 35 CProof of Theorem 1 46 DProof of Theorem 2 57 EProof of Theorem 3 68 FProof of Theorem 4 79 GImplementation Detail 810 We consider a L-layer (L 2) ReLU activated plain DNN module f: Rn0 RnL with input12 x Rp. Eq. 2 in the main text can be30 obtained by rewriting P An oblique tree is a binary tree where each node splits the space by a hyperplane rather than by34 thresholding a single feature. The tree starts with the root of the full input space S, and by recursively35 splitting S, the tree grows deeper. For a D-depth (D 3) binary tree, there are 2D 1 1 internal36 nodes and 2D 1 leaf nodes. As shown in Figure 1, each internal and leaf node maintains a sub-space37 representing a polyhedron in S, and each layer of the tree corresponds to a partition of the input38 space into polyhedrons.



Supplementary Material: Repulsive Deep Ensembles are Bayesian ANon-identifiable neural networks

Neural Information Processing Systems

Deep neural networks are parametric models able to learn complex non-linear functions from few training instances and thus can be deployed to solve many tasks. Their overparameterized architecture, characterized by a number of parameters far larger than that of training data points, enables them to retain entire datasets even with random labels [84]. Even more, this overparameterized regime makes neural network approximations of a given function not unique in the sense that multiple configurations of weights might lead to the same function. Indeed, the output of a feed forward neural network given some fixed input remains unchanged under a set of transformations. For instance, certain weight permutations and sign flips in MLPs leave the output unchanged [9].






Polyhedron Attention Module: Learning Adaptive-order Interactions Anonymous Author(s) Affiliation Address email Appendixes

Neural Information Processing Systems

's leaf nodes to form Given the definition of our attention in Eq. 9 in the main text, the highest polynomial order is Before providing the proof of Theorem 4, we establish Lemma 1 as its foundation. We follow the principle of Y an et al's work [ Figure 1, we consider two kinds of value functions, i.e., In P AM-Net, we set the number of levels to 2. A grid search is performed over different configurations We conduct grid searches on the dropout rate over {0, 0.1, 0.2} and the initial


A Proof of Theorem 1 Proof

Neural Information Processing Systems

Theorem 6 is stated in terms of Gaussian complexity. Ben-David (2014) has a full proof. M (α)null is the linear class following the depth-K neural network. The second term relies on the Lipschitz constant of DNN, which we bound with the following lemma. Similar results are given by Scaman and Virmaux (2018); Fazlyab et al. (2019).