toreviewer
dbc4d84bfcfe2284ba11beffb853a8c4-AuthorFeedback.pdf
Note that the theoretical equivalence requires near-zero initialization, gradient flow (small5 learning rate),and alargenumber ofchannels. These require significant computation resource. One of the advantages of kernel methods is that they requirelittle21 computationon a small dataset, which is a very appealing feature for architecture search.
959ef477884b6ac2241b19ee4fb776ae-AuthorFeedback.pdf
The proposed group bilinear requires the intra-group channels to be highly5 correlated (refer tothedefinitioninQ3.1),andtheproposed semantic grouping canbetter satisfy suchrequirements6 than MA-CNN [9]. Specifically,[9] adopts the idea of k-means, which optimizes each channel to its cluster center.7 Note that the notations aboveare the same with Eqn.16 (3),andthepairwisecorrelationis dij = Thanks foryour comments.Aisanapproximate indexmapping20 matrix, whose rows are constrained to be (approximate) one-hot vectors via asoftmax with small "temperature".21 Q2.2Inconsistentnotations. Thanks for your comments, and we will correct the notation "stage 3,4" into "Stage28 IV,V"respectively. Designing suitable grouping methods plays a42 keyrole.
8804f94e16ba5b680e239a554a08f7d2-AuthorFeedback.pdf
We train the autoencoder and the classifier on the training set, which is6 diverse and contains texts of varying degrees of attributes, reflected by the different confidence values given by the7 classifier. Different from most previous work that only provides binary control overattributes, one advantage of our model is13 the ability to givecontrol over the degree of attribute transfer desired. Particularly, 'Acc' is used to evaluate the attribute's accuracy.
7716d0fc31636914783865d34f6cdfd5-AuthorFeedback.pdf
This is becausea>t a takes a large amount of iterations to increase from negative to0.26 Consequently,withalargestepsize,wcanmovefarawayfromw beforea>t a becomesnonnegative. For problems with multiple global optima, our analysis can still be applied if the35 following condition holds: there exists one global optimum such that the PD condition holds globally with respect36 tothis optimum.
ec24a54d62ce57ba93a531b460fa8d18-AuthorFeedback.pdf
We experimented on using Sigmoid functions, and it does not work. To Reviewer #3: Our algorithm scales linearly w.r.t. the input size, which is not larger than any other algorithms.12 Thiscanberealized42 by a max operation, which is differentiable. Note that we don't need to use argmax operation to find the indexr.43 Instead, every decoded token is involved in the later computation.
d63fbf8c3173730f82b150c5ef38b8ff-AuthorFeedback.pdf
Inourlatestversion,wehaveallowedtheMarkov22 chain to start from an arbitrary initial distributionφ rather than the stationary distributionπ. To verify that is a meaning-34 ful range for tuningL, we enumerate trajectory lengthL from {104,,1010}, estimate the co-occurrence ma-35 trix with the single trajectory sampled from BlogCatalog, convert the co-occurrence matrix to the one required36 by NetMF, and factorize it with SVD.