Goto

Collaborating Authors

 Gradient Descent



8b9e7ab295e87570551db122a04c6f7c-Supplemental.pdf

Neural Information Processing Systems

Neural transport augmented sampling, firstintroduced byParnoandMarzouk (2018),isageneral method for using normalizing flows to sample from a given densityฯ€. Thus, samples can be generated fromฯ€(ฮธ)by running MCMC chain in theZ-space and pushing these samples onto theฮ˜-space usingT. Neural transport augmented samplers havebeen subsequently extended by Hoffman etal. In this paper, we proposed equivariant Stein variational gradient descent algorithm for sampling fromdensities thatareinvarianttosymmetry transformations. Another contributionofourworkis subsequently using this equivariant sampling method to efficiently train equivariant energy based models forprobabilistic modeling andinference.


Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions Wei Jiang 1, Sifan Y ang

Neural Information Processing Systems

Problem (1) has been comprehensively investigated in the literature [Duchi et al., 2011, Kingma and Ba, 2015, Loshchilov and Hutter, 2017], and it is well-known that the classical stochastic gradient descent (SGD) achieves a convergence rate of


6fee03d84375a159ecd3769ebbacae83-Supplemental-Conference.pdf

Neural Information Processing Systems

Convergence of stochastic gradient descent for non-smooth problems is a known result. For completeness, wereproduce and adapt ausual proof toour setting. Let us denote byF the class of functions fromX toY we are going to work with. Assumption 1 states that we have a well-specified modelF to estimate the median,i.e. Let us begin by controlling the estimation error.


ActiveLabeling: StreamingStochasticGradients

Neural Information Processing Systems

The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the"activelabeling" problem, whichfocuses onactivelearningwith partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples.


Considerminimizinganempiricalloss min

Neural Information Processing Systems

Many learning tasks, such as regression and classification, are usually framed that way [1]. When N 1, computing the gradient of the objective in(1) becomes a bottleneck, even if individual gradients ฮธL(zi,ฮธ) are cheap to evaluate. For a fixed computational budget, itisthustempting toreplace vanilla gradient descent bymore iterations but using anapproximate gradient, obtained using only afewdata points. Stochastic gradient descent (SGD; [2]) follows this template.



6db3ea527f53682657b3d6b02a841340-Supplemental-Conference.pdf

Neural Information Processing Systems

Westudy theasynchronous stochastic gradient descent algorithm fordistributed training overn workers which have varying computation and communication frequencyovertime.


6db3ea527f53682657b3d6b02a841340-Paper-Conference.pdf

Neural Information Processing Systems

Westudy theasynchronous stochastic gradient descent algorithm fordistributed training overn workers which have varying computation and communication frequencyovertime.