Country
ProGraML: Graph-based Deep Learning for Program Optimization and Analysis
Cummins, Chris, Fisches, Zacharias V., Ben-Nun, Tal, Hoefler, Torsten, Leather, Hugh
The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation. We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks. ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both.
Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena
Emschwiller, Matt, Gamarnik, David, Kızıldağ, Eren C., Zadik, Ilias
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly exceeds the sample sizes, and the model perfectly fits the in-training data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data. In this paper we prove a series of results which provide a somewhat diverging explanation. Adopting a teacher/student model where the teacher network is used to generate the predictions and student network is trained on the observed labeled data, and then tested on out-of-sample data, we show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension and approximation guarantee alone, regardless of the number of internal nodes of either teacher or student network. Our claim is based on approximating both teacher and student networks by polynomial (tensor) regression models with degree depending on the desired accuracy and network depth only. Such a parametrization notably does not depend on the number of internal nodes. Thus a message implied by our results is that parametrizing wide neural networks by the number of hidden nodes is misleading, and a more fitting measure of parametrization complexity is the number of regression coefficients associated with tensorized data. In particular, this somewhat reconciles the generalization ability of neural networks with more classical statistical notions of data complexity and generalization bounds. Our empirical results on MNIST and Fashion-MNIST datasets indeed confirm that tensorized regression achieves a good out-of-sample performance, even when the degree of the tensor is at most two.
Creating Synthetic Datasets via Evolution for Neural Program Synthesis
Program synthesis is the task of automatically generating a program consistent with a given specification. A natural way to specify programs is to provide examples of desired input-output behavior, and many current program synthesis approaches have achieved impressive results after training on randomly generated input-output examples. However, recent work has discovered that some of these approaches generalize poorly to data distributions different from that of the randomly generated examples. We show that this problem applies to other state-of-the-art approaches as well and that current methods to counteract this problem are insufficient. We then propose a new, adversarial approach to control the bias of synthetic data distributions and show that it outperforms current approaches.
Efficient Tensor Kernel methods for sparse regression
Hibraj, Feliks, Pelillo, Marcello, Salzo, Saverio, Pontil, Massimiliano
Recently, classical kernel methods have been extended by the introduction of suitable tensor kernels so to promote sparsity in the solution of the underlying regression problem. Indeed, they solve an lp-norm regularization problem, with p=m/(m-1) and m even integer, which happens to be close to a lasso problem. However, a major drawback of the method is that storing tensors requires a considerable amount of memory, ultimately limiting its applicability. In this work we address this problem by proposing two advances. First, we directly reduce the memory requirement, by intriducing a new and more efficient layout for storing the data. Second, we use a Nystrom-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost. Experiments, both on synthetic and read datasets, show the effectiveness of the proposed improvements. Finally, we take case of implementing the cose in C++ so to further speed-up the computation.
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
Koloskova, Anastasia, Loizou, Nicolas, Boreiri, Sadra, Jaggi, Martin, Stich, Sebastian U.
Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency. In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities. Our algorithmic framework covers local SGD updates and synchronous and pairwise gossip updates on adaptive network topology. We derive universal convergence rates for smooth (convex and non-convex) problems and the rates interpolate between the heterogeneous (non-identically distributed data) and iid-data settings, recovering linear convergence rates in many special cases, for instance for over-parametrized models. Our proofs rely on weak assumptions (typically improving over prior work in several aspects) and recover (and improve) the best known complexity results for a host of important scenarios, such as for instance coorperative SGD and federated averaging (local SGD).
A classification for the performance of online SGD for high-dimensional inference
Arous, Gerard Ben, Gheissari, Reza, Jagannath, Aukosh
Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from a large number of independent samples of data by iteratively optimizing a loss function. This loss function is high-dimensional, random, and often complex. We study here the performance of the simplest version of SGD, namely online SGD, in the initial "search" phase, where the algorithm is far from a trust region and the loss landscape is highly non-convex. To this end, we investigate the performance of online SGD at attaining a "better than random" correlation with the unknown parameter, i.e, achieving weak recovery. Our contribution is a classification of the difficulty of typical instances of this task for online SGD in terms of the number of samples required as the dimension diverges. This classification depends only on an intrinsic property of the population loss, which we call the information exponent. Using the information exponent, we find that there are three distinct regimes---the easy, critical, and difficult regimes---where one requires linear, quasilinear, and polynomially many samples (in the dimension) respectively to achieve weak recovery. We illustrate our approach by applying it to a wide variety of estimation tasks such as parameter estimation for generalized linear models, two-component Gaussian mixture models, phase retrieval, and spiked matrix and tensor models, as well as supervised learning for single-layer networks with general activation functions. In this latter case, our results translate into a classification of the difficulty of this task in terms of the Hermite decomposition of the activation function.
Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting
Wu, Lemeng, Ye, Mao, Lei, Qi, Lee, Jason D., Liu, Qiang
We propose signed splitting steepest descent (S3D), which progressively grows neural architectures by splitting critical neurons into multiple copies, following a theoretically-derived optimal scheme. Our algorithm is a generalization of the splitting steepest descent (S2D) of Liu et al. (2019b), but significantly improves over it by incorporating a rich set of new splitting schemes that allow negative output weights. By doing so, we can escape local optima that the original S2D can not escape. Theoretically, we show that our method provably learns neural networks with much smaller sizes than these needed for standard gradient descent in overparameterized regimes. Empirically, our method outperforms S2D and prior arts on various challenging benchmarks, including CIFAR-100, ImageNet and ModelNet40.
Diffusion-based Deep Active Learning
The remarkable performance of deep neural networks depends on the availability of massive labeled data. To alleviate the load of data annotation, active deep learning aims to select a minimal set of training points to be labelled which yields maximal model accuracy. Most existing approaches implement either an `exploration'-type selection criterion, which aims at exploring the joint distribution of data and labels, or a `refinement'-type criterion which aims at localizing the detected decision boundaries. We propose a versatile and efficient criterion that automatically switches from exploration to refinement when the distribution has been sufficiently mapped. Our criterion relies on a process of diffusing the existing label information over a graph constructed from the hidden representation of the data set as provided by the neural network. This graph representation captures the intrinsic geometry of the approximated labeling function. The diffusion-based criterion is shown to be advantageous as it outperforms existing criteria for deep active learning.
A termination criterion for stochastic gradient descent for binary classification
Baghal, Sina, Paquette, Courtney, Vavasis, Stephen A.
Here the loss function l: R R R, the probability distribution P is unknown, and the data sample (ζ,y) R d R is a random vector distributed as P. The most prevalent algorithm employed for solving(1) is stochastic gradient descent (SGD). Whereas a significant amount of work has been devoted to the convergence analysis of SGD (see, e.g., Robbins and Monro (1951); Bottou et al. (2018); Bubeck (2015); Pflug (1986)), leading, in particular, to learning rate schedules, the question of how to terminate the algorithm when one is near an optimal classifier remains largely unaddressed. Yet, inexpensive stopping criteria are of utmost interest in machine learning. For instance, if one could produce a low cost test to determine near-optimality, then without sacrificing the quality of the solution or efficiency of the SGD algorithm, needless computational time would be eliminated. Secondly, early termination tests impose a degree of predictability on accuracy and running times-a useful quality when SGD occurs as a subproblem of a larger computation. Several works show that early termination of SGD can prevent overfitting, speed up learning procedures, and/or improve generalization properties (Prechelt, 2012; Hardt et al., 2016; Yao et al., 2007).
Graph Neural Networks for Decentralized Controllers
Gama, Fernando, Tolstaya, Ekaterina, Ribeiro, Alejandro
Dynamical systems comprised of autonomous agents arise in many relevant problems such as multi-agent robotics, smart grids, or smart cities. Controlling these systems is of paramount importance to guarantee a successful deployment. Optimal centralized controllers are readily available but face limitations in terms of scalability and practical implementation. Optimal decentralized controllers, on the other hand, are difficult to find. In this paper, we use graph neural networks (GNNs) to learn decentralized controllers from data. GNNs are well-suited for the task since they are naturally distributed architectures. Furthermore, they are equivariant and stable, leading to good scalability and transferability properties. The problem of flocking is explored to illustrate the power of GNNs in learning decentralized controllers.