Reviews: Can SGD Learn Recurrent Neural Networks with Provable Generalization?

Neural Information Processing Systems 

This paper show that Elman RNNs optimized with vanilla SGD can learn concepts where the target output at each position of the sequence is any function of the previous L inputs that can be encoded in a two-layer smooth neural network. There are multiple assumptions and complications in showing the main result. The crux of the proof is to show that if the RNN is overparameterized enough, then if we start from a randomly initialized RNN matrix W, there exists a function which is linear in matrix W* whose value at a specific W* is a good approximation to the target in the concept class. Showing that SGD moves in a direction similar to such W* gives the desired result. Another interesting aspect of the main result is that the number of samples that SGD needs depends only logarithmically with respect to the number of RNN neurons, making it applicable to overparameterized settings.