Goto

Collaborating Authors

 toreviewer


dbc4d84bfcfe2284ba11beffb853a8c4-AuthorFeedback.pdf

Neural Information Processing Systems

Note that the theoretical equivalence requires near-zero initialization, gradient flow (small5 learning rate),and alargenumber ofchannels. These require significant computation resource. One of the advantages of kernel methods is that they requirelittle21 computationon a small dataset, which is a very appealing feature for architecture search.


959ef477884b6ac2241b19ee4fb776ae-AuthorFeedback.pdf

Neural Information Processing Systems

The proposed group bilinear requires the intra-group channels to be highly5 correlated (refer tothedefinitioninQ3.1),andtheproposed semantic grouping canbetter satisfy suchrequirements6 than MA-CNN [9]. Specifically,[9] adopts the idea of k-means, which optimizes each channel to its cluster center.7 Note that the notations aboveare the same with Eqn.16 (3),andthepairwisecorrelationis dij = Thanks foryour comments.Aisanapproximate indexmapping20 matrix, whose rows are constrained to be (approximate) one-hot vectors via asoftmax with small "temperature".21 Q2.2Inconsistentnotations. Thanks for your comments, and we will correct the notation "stage 3,4" into "Stage28 IV,V"respectively. Designing suitable grouping methods plays a42 keyrole.


8804f94e16ba5b680e239a554a08f7d2-AuthorFeedback.pdf

Neural Information Processing Systems

We train the autoencoder and the classifier on the training set, which is6 diverse and contains texts of varying degrees of attributes, reflected by the different confidence values given by the7 classifier. Different from most previous work that only provides binary control overattributes, one advantage of our model is13 the ability to givecontrol over the degree of attribute transfer desired. Particularly, 'Acc' is used to evaluate the attribute's accuracy.


7716d0fc31636914783865d34f6cdfd5-AuthorFeedback.pdf

Neural Information Processing Systems

This is becausea>t a takes a large amount of iterations to increase from negative to0.26 Consequently,withalargestepsize,wcanmovefarawayfromw beforea>t a becomesnonnegative. For problems with multiple global optima, our analysis can still be applied if the35 following condition holds: there exists one global optimum such that the PD condition holds globally with respect36 tothis optimum.



01386bd6d8e091c2ab4c7c7de644d37b-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their valuable feedback. As the reviewers point out, the deep1 equilibrium model offers a new perspective on deep networks. However, since the submission we have further observed that35 the conclusion from Figure 4 also holds in training.


ec24a54d62ce57ba93a531b460fa8d18-AuthorFeedback.pdf

Neural Information Processing Systems

We experimented on using Sigmoid functions, and it does not work. To Reviewer #3: Our algorithm scales linearly w.r.t. the input size, which is not larger than any other algorithms.12 Thiscanberealized42 by a max operation, which is differentiable. Note that we don't need to use argmax operation to find the indexr.43 Instead, every decoded token is involved in the later computation.


d63fbf8c3173730f82b150c5ef38b8ff-AuthorFeedback.pdf

Neural Information Processing Systems

Inourlatestversion,wehaveallowedtheMarkov22 chain to start from an arbitrary initial distributionφ rather than the stationary distributionπ. To verify that is a meaning-34 ful range for tuningL, we enumerate trajectory lengthL from {104,,1010}, estimate the co-occurrence ma-35 trix with the single trajectory sampled from BlogCatalog, convert the co-occurrence matrix to the one required36 by NetMF, and factorize it with SVD.


9813b270ed0288e7c0388f0fd4ec68f5-AuthorFeedback.pdf

Neural Information Processing Systems

We will include a supplemental material to describe the preprocessing of fMRI46 data and network implementation details. And we will explain indetail what isnot clear inthe paper.


76cf99d3614e23eabab16fb27e944bf9-AuthorFeedback.pdf

Neural Information Processing Systems

Asshownin"Input"35 of Algorithm 1,Aj (1 < j n) are sampled as fixed matrices. They construct the relations between the original36 space and thecompressed space, and arenotlearnable inEq. The implementation details are described in the supplementary material. The code is based on39 PyTorch.