to

### Training Quantized Neural Networks to Global Optimality via Semidefinite Programming

In this paper we focus on training quantized neural networks for efficient machine learning models. We consider the combinatorial and non-convex optimization of minimizing empirical error with respect to quantized weights. We focus on polynomial activation functions, where the training problem is still non-trivial to solve. Recent work has shown that two-layer neural networks with ReLU [35, 36] and leaky ReLU activations [25] can be trained via convex optimization in polynomial time with respect to the number of samples and neurons. Moreover, degree-two polynomial activations can be trained to global optimality in polynomial time with respect to all problem dimensions using semidefinite programming [7]. In this work, we take a similar convex duality approach that involves semidefinite programming. However, our method and theoretical analysis are substantially different since we consider quantized weights, which involves a discrete non-convex optimization problem. The fact that the first layer weights are discrete renders it a combinatorial NP-hard problem and thus we cannot hope to obtain a similar result as in [7] or [35].

### Convex Regularization Behind Neural Reconstruction

Neural networks have shown tremendous potential for reconstructing highresolution images in inverse problems. The non-convex and opaque nature of neural networks, however, hinders their utility in sensitive applications such as medical imaging. To cope with this challenge, this paper advocates a convex duality framework that makes a two-layer fully-convolutional ReLU denoising network amenable to convex optimization. The convex dual network not only offers the optimum training with convex solvers, but also facilitates interpreting training and prediction. In particular, it implies training neural networks with weight decay regularization induces path sparsity while the prediction is piecewise linear filtering. A range of experiments with MNIST and fastMRI datasets confirm the efficacy of the dual network optimization problem. In the age of AI, image reconstruction has witnessed a paradigm shift that impacts several applications ranging from natural image super-resolution to medical imaging.

### Adaptive Iterative Hessian Sketch via A-Optimal Subsampling

Iterative Hessian sketch (IHS) is an effective sketching method for modeling large-scale data. It was originally proposed by Pilanci and Wainwright (2016; JMLR) based on randomized sketching matrices. However, it is computationally intensive due to the iterative sketch process. In this paper, we analyze the IHS algorithm under the unconstrained least squares problem setting, then propose a deterministic approach for improving IHS via A-optimal subsampling. Our contributions are three-fold: (1) a good initial estimator based on the $A$-optimal design is suggested; (2) a novel ridged preconditioner is developed for repeated sketching; and (3) an exact line search method is proposed for determining the optimal step length adaptively. Extensive experimental results demonstrate that our proposed A-optimal IHS algorithm outperforms the existing accelerated IHS methods.

### Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks

Despite several attempts, the fundamental mechanisms behind the success of deep neural networks still remain elusive. To this end, we introduce a novel analytic framework to unveil hidden convexity in training deep neural networks. We consider a parallel architecture with multiple ReLU sub-networks, which includes many standard deep architectures and ResNets as its special cases. We then show that the training problem with path regularization can be cast as a single convex optimization problem in a high-dimensional space. We further prove that the equivalent convex program is regularized via a group sparsity inducing norm. Thus, a path regularized parallel architecture with ReLU sub-networks can be viewed as a parsimonious feature selection method in high-dimensions. More importantly, we show that the computational complexity required to globally optimize the equivalent convex problem is polynomial-time with respect to the number of data samples and feature dimension. Therefore, we prove exact polynomial-time trainability for path regularized deep ReLU networks with global optimality guarantees. We also provide several numerical experiments corroborating our theory.

### Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, which leads to both theoretical and empirical improvements in the convergence rate of distributed second order methods. Our technique has two novel components: (1) modifying standard sketching techniques to obtain what we call a surrogate sketch; and (2) carefully scaling the global regularization parameter for local computations. Our surrogate sketches are based on determinantal point processes, a family of distributions for which the bias of an estimate of the inverse Hessian can be computed exactly. Based on this computation, we show that when the objective being minimized is $l_2$-regularized with parameter $\lambda$ and individual machines are each given a sketch of size $m$, then to eliminate the bias, local estimates should be computed using a shrunk regularization parameter given by $\lambda^{\prime}=\lambda\cdot(1-\frac{d_{\lambda}}{m})$, where $d_{\lambda}$ is the $\lambda$-effective dimension of the Hessian (or, for quadratic problems, the data matrix).