AITopics

2010.03648

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningOct-6-2020

InstaHide: Instance-hiding Schemes for Private Distributed Learning

Huang, Yangsibo, Song, Zhao, Li, Kai, Arora, Sanjeev

How can multiple distributed entities collaboratively train a shared deep net on their private data while preserving privacy? This paper introduces InstaHide, a simple encryption of training images, which can be plugged into existing distributed deep learning pipelines. The encryption is efficient and applying it during training has minor effect on test accuracy. InstaHide encrypts each training image with a "one-time secret key" which consists of mixing a number of randomly chosen images and applying a random pixel-wise mask. Other contributions of this paper include: (a) Using a large public dataset (e.g. ImageNet) for mixing during its encryption, which improves security. (b) Experimental results to show effectiveness in preserving privacy against known attacks with only minor effects on accuracy. (c) Theoretical analysis showing that successfully attacking privacy requires attackers to solve a difficult computational problem. (d) Demonstrating that use of the pixel-wise mask is important for security, since Mixup alone is shown to be insecure to some some efficient attacks. (e) Release of a challenge dataset https://github.com/Hazelsuko07/InstaHide_Challenge Our code is available at https://github.com/Hazelsuko07/InstaHide

deep learning, instahide, neural network, (22 more...)

2010.02772

Country:

North America > United States > Massachusetts (0.14)
North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(2 more...)

arXiv.org Machine LearningOct-27-2019

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Arora, Sanjeev, Du, Simon S., Li, Zhiyuan, Salakhutdinov, Ruslan, Wang, Ruosong, Yu, Dingli

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et al. (2019a), which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks. 1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets. 2. On CIFAR-10 with 10 - 640 training samples, Convolutional NTK consistently beats ResNet-34 by 1% - 3%. 3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance. 4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a). NTK's efficacy may trace to lower variance of output.

artificial intelligence, dataset, neural network, (16 more...)

1910.01663

Country: North America (0.46)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

arXiv.org Machine LearningOct-17-2019

An Exponential Learning Rate Schedule for Deep Learning

Li, Zhiyuan, Arora, Sanjeev

A BSTRACT Intriguing empirical evidence exists that deep learning can work well with exotic schedules for varying the learning rate. This paper suggests that the phenomenon may be due to Batch Normalization or BN(Ioffe & Szegedy, 2015), which is ubiquitous and provides benefits in optimization and generalization across all standard architectures. The following new results are shown about BN with weight decay and momentum (in other words, the typical use case which was not considered in earlier theoretical analyses of stand-alone BN (Ioffe & Szegedy, 2015; Santurkar et al., 2018; Arora et al., 2018) - Training can be done using SGD with momentum and an exponentially increasing learning rate schedule, i.e., learning rate increases by some (1 α) factor in every epoch for some α 0. (Precise statement in the paper.) To the best of our knowledge this is the first time such a rate schedule has been successfully used, let alone for highly successful architectures. As expected, such training rapidly blows up network weights, but the net stays well-behaved due to normalization. This equivalence holds for other normalization layers as well, Group Normalization(Wu & He, 2018), Layer Normalization(Ba et al., 2016), Instance Norm(Ulyanov et al., 2016), etc. - A worked-out toy example illustrating the above linkage of hyper-parameters. Using either weight decay or BN alone reaches global minimum, but convergence fails when both are used. Usually best performance is attained by adding weight decay and momentum in addition to BN. Usually weight decay is thought to improve generalization by controlling the norm of the parameters. However, it is fallacious to try to separately think of optimization and generalization because we are dealing with a nonconvex objective with multiple optima. Even slight changes to the training surely lead to a different trajectory in the loss landscape, potentially ending up at a different solution! One needs trajectory analysis to have a hope of reasoning about the effects of such changes. In the presence of BN and other normalization schemes, including GroupNorm, LayerNorm, and InstanceNorm, the optimization objective is scale invariant to the parameters, which means rescaling parameters would not change the prediction, except the parameters that compute the output which do not have BN.

computer based training, deep learning, null, (21 more...)

1910.07454

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

arXiv.org Machine LearningJun-14-2019

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Kuditipudi, Rohith, Wang, Xiang, Lee, Holden, Zhang, Yi, Li, Zhiyuan, Hu, Wei, Arora, Sanjeev, Ge, Rong

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

artificial intelligence, neural network, probability, (16 more...)

1906.06247

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceJun-6-2019

A Simple Saliency Method That Passes the Sanity Checks

Gupta, Arushi, Arora, Sanjeev

There is great interest in "saliency methods" (also called "attribution methods"), which give "explanations" for a deep net's decision, by assigning a "score" to each feature/pixel in the input. Their design usually involves credit-assignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned the validity of many of these methods since they do not pass simple *sanity checks* which test whether the scores shift/vanish when layers of the trained net are randomized, or when the net is retrained using random labels for inputs. We propose a simple fix to existing saliency methods that helps them pass sanity checks, which we call "competition for pixels". This involves computing saliency maps for all possible labels in the classification task, and using a simple competition among them to identify and remove less relevant pixels from the map. The simplest variant of this is "Competitive Gradient $\odot$ Input (CGI)": it is efficient, requires no additional training, and uses only the input and gradient. Some theoretical justification is provided for it (especially for ReLU networks) and its performance is empirically demonstrated.

artificial intelligence, neural network, saliency map, (18 more...)

1905.12152

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceJun-4-2019

Implicit Regularization in Deep Matrix Factorization

Arora, Sanjeev, Cohen, Nadav, Hu, Wei, Luo, Yuping

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.

deep learning, equation, neural network, (17 more...)

1905.13655

Country: Europe > Belgium (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Machine LearningApr-26-2019

On Exact Computation with an Infinitely Wide Neural Net

Arora, Sanjeev, Du, Simon S., Hu, Wei, Li, Zhiyuan, Salakhutdinov, Ruslan, Wang, Ruosong

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width" --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. A subsequent paper [Lee et al., 2019] gave heuristic Monte Carlo methods to estimate the NTK and its extension, Convolutional Neural Tangent Kernel (CNTK) and used this to try to understand the limiting behavior on datasets like CIFAR-10. The current paper gives the first efficient exact algorithm (based upon dynamic programming) for computing CNTK as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 5% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). We give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK. Our experiments also demonstrate that earlier Monte Carlo approximation can degrade the performance significantly, thus highlighting the power of our exact kernel computation, which we have applied even to the full CIFAR-10 dataset and 20-layer nets.

deep learning, kernel, neural network, (16 more...)

1904.11955

Country: North America > United States (0.28)

Genre: Research Report (0.81)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-25-2019

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj

Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.

artificial intelligence, neural network, representation, (16 more...)

1902.09229

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report > Experimental Study (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)

arXiv.org Machine LearningJan-24-2019

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Arora, Sanjeev, Du, Simon S., Hu, Wei, Li, Zhiyuan, Wang, Ruosong

The well-known work of Zhang et al. (2017) highlighted intriguing experimental phenomena about deep net training - specifically, optimization and generalization-and asked whether theory could explain them. They showed that sufficiently powerful nets (with vastly more parameters than number of training samples) can attain zero training error, regardless of whether the data is properly labeled or randomly labeled. Obviously, trainingwith randomly labeled data cannot generalize, whereas training with properly labeled data generalizes. See Figure 2 replicating some of these results. Recent papers have begun to provide explanations, showing that gradient descent can allow an overparametrized multi-layernet to attain arbitrarily low training error on fairly generic datasets (Du et al., 2018a,c; Li & Liang, 2018; Allen-Zhu et al., 2018b; Zou et al., 2018), provided the amount of overparametrization isa high polynomial of the relevant parameters (i.e.

arxiv preprint arxiv, deep learning, neural network, (14 more...)

1901.08584

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)