Review for NeurIPS paper: Learning compositional functions via multiplicative weight updates
–Neural Information Processing Systems
Weaknesses: I was not totally convinced by the experiments section, and have questions about that section and some more general questions which the authors might address: 1. The way that Figure 1 is laid out suggests that it is appropriate to compare the three algorithms over the same set of values of eta. Can the authors justify this? It seems to me that the meaning of eta in the Madam algorithm is different to its meaning in SGD and Adam (it's effectively a coincidence that these different hyper-parameters share a name). What happens if you evaluate Madam over a denser grid of eta values and then zoom in the x axis of the left hand plot? 2. The value of the transformer, on the wikitext-2 task, for SGD and Madam, seems very high. Perhaps the authors are using a different unit of measurement?
Neural Information Processing Systems
Jan-26-2025, 22:11:22 GMT
- Technology: