Gradient Descent
8b9e7ab295e87570551db122a04c6f7c-Supplemental.pdf
Neural transport augmented sampling, firstintroduced byParnoandMarzouk (2018),isageneral method for using normalizing flows to sample from a given densityฯ. Thus, samples can be generated fromฯ(ฮธ)by running MCMC chain in theZ-space and pushing these samples onto theฮ-space usingT. Neural transport augmented samplers havebeen subsequently extended by Hoffman etal. In this paper, we proposed equivariant Stein variational gradient descent algorithm for sampling fromdensities thatareinvarianttosymmetry transformations. Another contributionofourworkis subsequently using this equivariant sampling method to efficiently train equivariant energy based models forprobabilistic modeling andinference.
6fee03d84375a159ecd3769ebbacae83-Supplemental-Conference.pdf
Convergence of stochastic gradient descent for non-smooth problems is a known result. For completeness, wereproduce and adapt ausual proof toour setting. Let us denote byF the class of functions fromX toY we are going to work with. Assumption 1 states that we have a well-specified modelF to estimate the median,i.e. Let us begin by controlling the estimation error.
ActiveLabeling: StreamingStochasticGradients
The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the"activelabeling" problem, whichfocuses onactivelearningwith partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples.
Considerminimizinganempiricalloss min
Many learning tasks, such as regression and classification, are usually framed that way [1]. When N 1, computing the gradient of the objective in(1) becomes a bottleneck, even if individual gradients ฮธL(zi,ฮธ) are cheap to evaluate. For a fixed computational budget, itisthustempting toreplace vanilla gradient descent bymore iterations but using anapproximate gradient, obtained using only afewdata points. Stochastic gradient descent (SGD; [2]) follows this template.