1. We agree (and will acknowledge more explicitly) that the overall proof program is similar to

Neural Information Processing Systems 

We thank all the reviewers for their constructive feedback! Here we provide more experimental results in Figure 1. Hence, gradient descent can get stuck in local optima more easily. R3's observations for collaborative filtering (CF) are valid but they apply