[R] 2 Hr. Talk "Information Theory of Deep Learning" (Naftali Tishby) • r/MachineLearning
This 2 hour long talk clarifies and goes over in detail many of the details people were interested in from his original talk. Questions in order; what is the difference between the noise in SGD and the typical Langevin dynamics, how does the theory deal with saturated gradients, is it a reasonable strategy to perform early stopping specifically before the compression phase, have you tried the framework on ResNets, what is the message for practitioners, how does the performance of dropout/regularizers relate to the theory, have you considered the connection to a fermi gas equilibrium.
Oct-12-2017, 06:20:44 GMT
- Technology: