empirical comparison
[R1 & R3 Sufficient discussion of the difference and direct empirical comparisons with Cycle-wgan [7] ]
We appreciate the feedback from R1, R2, and R3. We address the questions below and will revise our paper accordingly. We will add the discussions and more empirical comparisons in the final version. Comparison between the reported results of Cycle-wgan [7] and our model. We will add the results in the final version. We will correct these typos in the final version.
. We thank Reviewer # 1 for pointing out this interesting work. Both our
We thank all reviewers for taking the time to provide detailed feedback and valuable suggestions for our work. However, their PDFs' exact expressions are in fact different. Alternatively, one can also use the No U-Turn Sampler (NUTS) implemented in Stan. As shown in equation (5) of the main text, BNE's mean function consists of the In the experiment, BNE is by construction more expressive than BAE. Figures 4-5 suggest that the former is true, but not the latter.
Response to Reviewer
C2: Rademacher complexity in this paper refers to its empirical version and we will clarify this in the future version. We use policy evaluation errors to evaluate the quality of model learning. The results in Section 6.2 indicate that The shading on plots refers to the standard deviation over 3 random seeds. We will clarify this in the future version. The word "generation" means that the policy could perform For your choice of "No" to the reproducibility evaluation, we would like to point out that the proof, source We thank you for your insightful suggestion.
226d1f15ecd35f784d2a20c3ecf56d7f-Reviews.html
R2/R3: 'To better judge the performance of the proposed method, it would be useful to include comparisons against existing approaches to the problem such as [Meeds et al., NIPS 2007] and [Bittdorf et al., NIPS 2012]' We have meanwhile conducted an experimental comparison to the LP approach of Bittdorf et al. in the separable setting with binary T. We have found that our approach is more robust to noise, and we plan to add these results in the final version. The model of Meeds et al. involves two binary factors in a three-factor factorization and is hence different from our factorization model. A comparison can still be performed when running our approach in a two-step manner. Provided that code can be obtained from Meeds et al, such a comparison would be included in the final version. Please note that the paper already contains a comparison to two methods based on alternating optimization (a standard approach to NMF), adapted to our specific factorization problem.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper shows that the skip-gram model of Mikolov et al, when trained with their negative sampling approach can be understood as a weighted matrix factorization of a word-context matrix with cells weighted by point wise mutual information (PMI), which has long been empirically known to be a useful way of constructing word-context matrices for learning semantic representations of words. This is an important result since it provides a link between two (apparently) very different methods for constructing word embeddings that empirically performed well, but seemed on the surface to have nothing to do with each other. Using this insight, the authors then propose a new matrix construction and finds it performs very well on standard tasks. The paper is mostly admirably clear (see below for a few suggestions on where citations could be added to make the relevant related work clear) and very nice contribution to have to explain what is going on in these neural language model embedding models.
the paper be accepted
Regarding our proof techniques, the proof in Thm. 1 for NTK with two layers and bias borrows techniques from [6]. Our proof technique for deep networks uses the algebra of RKHSs and is therefore novel in this context. Thm. 2 derives bounds that result from the relation between the Fourier expansion of the Laplace kernel in NTK (established in Thm. 4) and identifying the spaces fixed under the appropriate integral transform. "why they need additional parameters a, b, c." We note that analogously NTK becomes sharper for deeper networks.