Reviews: Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations
–Neural Information Processing Systems
This seems to be a worthwhile goal (since plain RNNs are computationally cheaper and easier to analyze theoretically) and their experiments show some promising results in improving performance over plain RNNs trained with existing optimization methods. However, it is not clear to me how the method that the authors use in practice differs significantly from regular Path-SGD introduced in previous work. The authors do present an adaptation of Path-SGD to networks with shared weights, and show that the new rescaling term applied to the gradients can be divided into two terms k1 and k2. But then, they note that the second term, which accounts for interactions between shared weights along the same path, is expensive to calculate for RNNs and show some empirical evidence that including it does not help performance. In the rest of the experiments, they ignore the second term, which to my understanding is essentially what makes the method introduced here different from regular Path-SGD.
Neural Information Processing Systems
Jan-20-2025, 13:24:53 GMT
- Technology: