A Proof of Lemma
–Neural Information Processing Systems
According to [5], G has a closed graph and compact values. Furthermore, it holds that G(t) D(t) for all t R. Adopting the terminology from [5], D is conservative for the ReLU function, which implies that G is conservative for the ReLU function as well [5, Remark 3(e)]. Indeed the Clarke subdifferential is the convex hull of limits of sequences of gradients. For Lipschitz constant, we want the maximum norm element, which necessarily happens at a corner of the convex hull, therefore for our purposes it suffices to consider sequences. Since the ReLU network will be almost-everywhere differentiable, we can consider a shrinking sequence of balls around any point, and we will have gradients which are arbitrarily close to any corner of the differential at our given point. Therefore, the norms will converge to that norm, and thus it suffices to optimize over differentiable points, and what we choose at the nondifferentiability does not matter.
Neural Information Processing Systems
May-22-2025, 12:26:29 GMT
- Technology: