Goto

Collaborating Authors

 wehave


Large-batchOptimizationforDenseVisualPredictions

Neural Information Processing Systems

At thet-th backward propagation step, we can derive the gradient il(wt)toupdatei-th module inM. The number in the bracket represents the batch size. We see that when the batch size is small (i.e.,32), the gradientvariancesaresimilar. N and K indicate the number of FPN levels and region proposals fed into the detection head. To evaluate this assumption, as shown in Figure 1, we have three observations. As illustrated by the second figure in Figure 1, the gradient misalignment phenomenon between detection head and backbone has been reduced.


OntheSaturationEffectsofSpectralAlgorithms inLargeDimensions

Neural Information Processing Systems

Manynon-parametric regression methods areproposed to solve the regression problem by assuming thatf falls into certain function classes, including polynomial splines Stone (1994), local polynomials Cleveland (1979); Stone (1977), the spectral algorithmsCaponnetto(2006);CaponnettoandDeVito(2007);CaponnettoandYao(2010),etc.



ANear-OptimalBest-of-Both-WorldsAlgorithm forOnlineLearningwithFeedbackGraphs

Neural Information Processing Systems

We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both stochastic and adversarial environments. The bound against oblivious adversaries is O( αT), where T is the time horizon andα is the independence number of the feedback graph.



min

Neural Information Processing Systems

LetAbean nHermitian matrixandletBbea(n 1) (n 1)matrixwhich is constructed by deleting thei-th row andi-th column ofA. Denote thatΦ = [ϕ(x1),...,ϕ(xn)] Rn D, where D is the dimension of feature spaceH. Performing rank-n singular value decomposition (SVD) onΦ, we have Φ = HΣV, where H Rn n, Σ Rn n is a diagonal matrix whose diagonal elements are the singular values of Φ,andV RD n. F(α) in Eq.(21) is proven differentiable and thep-th component of the gradient is F(α) αp = Then, a reduced gradient descent algorithm [26] is adopted to optimize Eq.(21). The three deep neural networks are pre-trained on the ImageNet[5].



ff4d5fbbafdf976cfdc032e3bde78de5-Supplemental.pdf

Neural Information Processing Systems

As such we see that this variance depends on the structure of the densityρX with the variance of (I +λL) 1δX, and the labelling noise with the variance of(Y |X).


AInjectiveChange-of-VariableFormulaandStacking InjectiveFlows Wefirstderive(5)from(3). Bythechainrule,wehave: J[gφ ] g

Neural Information Processing Systems

We summarize our methods for computing/estimating the gradient of the log determinant arising inmaximum likelihood training ofrectangular flows. Algorithm 2showstheexactmethod, where jvp(f,z,)denotes computingJ[f](z) usingforward-mode AD,and i Rd isthei-thstandard basis vector, i.e. a one-hot vector with a1 on its i-th coordinate. Note that / θlogdetAθ is computed using backpropagation. Thefor loop is easily parallelized in practice.