Amit Daniely
Locally Private Learning without Interaction Requires Separation
Amit Daniely, Vitaly Feldman
We consider learning under the constraint of local differential privacy (LDP). For many learning problems known efficient algorithms in this model require many rounds of communication between the server and the clients holding the data points. Yet multi-round protocols are prohibitively slow in practice due to network latency and, as a result, currently deployed large-scale systems are limited to a single round. Despite significant research interest, very little is known about which learning problems can be solved by such non-interactive systems. The only lower bound we are aware of is for PAC learning an artificial class of functions with respect to a uniform distribution [39].
Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
Amit Daniely, Roy Frostig, Yoram Singer
We develop a general duality between neural networks and compositional kernel Hilbert spaces. We introduce the notion of a computation skeleton, an acyclic graph that succinctly describes both a family of neural networks and a kernel space. Random neural networks are generated from a skeleton through node replication followed by sampling from a normal distribution to assign weights. The kernel space consists of functions that arise by compositions, averaging, and non-linear transformations governed by the skeleton's graph topology and activation functions. We prove that random networks induce representations which approximate the kernel space. In particular, it follows that random weight initialization often yields a favorable starting point for optimization despite the worst-case intractability of training neural networks.
SGD Learns the Conjugate Kernel Class of the Network
Amit Daniely
While stochastic gradient decent (SGD) from a random initialization is probably the most popular supervised learning algorithm today, we have very few results that depicts conditions that guarantee its success. Indeed, to the best of our knowledge, Andoni et al. [2014] provides the only known result of this form, and it is valid in a rather restricted setting. Namely, for depth-2 networks, where the underlying distribution is Gaussian, the algorithm is full gradient decent (rather than SGD), and the task is regression when the learnt function is a constant degree polynomial. We build on the framework of Daniely et al. [2016] to establish guarantees on SGD in a rather general setting. Daniely et al. [2016] defined a framework that associates a reproducing kernel to a network architecture.
SGD Learns the Conjugate Kernel Class of the Network
Amit Daniely