Failures of Gradient-Based Deep Learning
Shalev-Shwartz, Shai, Shamir, Ohad, Shammah, Shaked
The success stories of deep learning form an ever lengthening list of practical breakthroughs and state-ofthe-art performances, ranging the fields of computer vision [23, 14, 25, 33], audio and natural language processing and generation [5, 15, 11, 34], as well as robotics [24, 26], to name just a few. The list of success stories can be matched and surpassed by a list of practical "tips and tricks", from different optimization algorithms, parameter tuning methods [30, 22], initialization schemes [10], architecture designs [31], loss functions, data augmentation [23] and so on. The current theoretical understanding of deep learning is far from being sufficient for a rigorous analysis of the difficulties faced by practitioners. Progress must be made from both parties: from a practitioner's perspective, emphasizing the difficulties provides practical insights to the theoretician, which in turn, supplies theoretical insights and guarantees, further strengthening and sharpening practical intuitions and wisdom. In particular, understanding failures of existing algorithms is as important as understanding where they succeed. Our goal in this paper is to present and discuss families of simple problems for which commonly used methods do not show as exceptional a performance as one might expect. We use empirical results and insights as a ground on which to build a theoretical analysis, characterising the sources of failure. Those understandings are aligned, and sometimes lead to, different approaches, either for an architecture, loss function, or an optimization scheme, and explain their superiority when applied to members of those families. Interestingly, the sources for failure in our experiment do not seem to relate to stationary point issues such as spurious local minima or a plethora of saddle points, a topic of much recent interest (e.g.
Apr-26-2017