Reviews: Is Deeper Better only when Shallow is Good?
–Neural Information Processing Systems
This is a good paper that suggests excellent directions for new work. The key point is captured in this statement: "we conjecture that a distribution which cannot be approximated by a shallow network cannot be learned using a gradient-based algorithm, even when using a deep architecture." The authors provide first steps towards investigating this claim. There has been a small amount of work on the typical expressivity of neural networks, in addition to the "worst-case approach." See the papers "Complexity of linear regions in deep networks" and "Deep ReLU Networks Have Surprisingly Few Activation Patterns" by Hanin and Rolnick, which prove that while the number of linear regions can be made to grow exponentially with the depth, the typical number of linear regions is much smaller. See also "Do deep nets really need to be deep?" by Ba and Caruana, which indicates that once deep networks have learned a function, shallow networks can often be trained to distill the deep networks without appreciable performance loss.
Neural Information Processing Systems
Jan-24-2025, 04:14:09 GMT
- Technology: