Goto

Collaborating Authors

 Gradient Descent






Supplementary File for " Stochastic Gradient Descent in Correlated Settings: AStudy on Gaussian Processes "

Neural Information Processing Systems

The supplementary file is organized as follows: Section 1 restates the assumptions and main theorems on the convergence of parameter iterates and the full gradient; Section 2 is devoted to the proofs of the two main theorems, while Section 3 includes the proofs of supporting lemmas; Section 4 includes additional figures from the numerical study. Under Assumptions 1.1 to 1.3, when m > C for some constant C > 0, we have the following results under two corresponding conditions on sl(m): First we present the following lemma, showing that the loss function has a property similar from strong convexity. For the first case discussed in Lemma 2.1, define eg(ฮธ(k)) = (g(ฮธ(k)))2, and for the second case define eg(ฮธ(k)) = g(ฮธ(k)). Therefore, combining Lemma 2.1, Lemma 2.2 and (7) leads to the following conclusion. Apply(15)inLemma 2.3 with = 12, then for any 0<ฮฑ<1, with probability at least 1 2m ฮฑ, we have A11 1 Under this case, we can still apply (15) in Lemma 2.3.





max k [K] hik(x)>1 B/K1 min i [n ] min l [K] aill(x)/K. Step2. Weassumethathjr(x)attainsthelargestvalueofhik(x)foranyi [n],k [K]. Then hjr(x)>1 min

Neural Information Processing Systems

A.1 ImplementationDetails Network Architecture: Inspired by [33], we utilize a pre-trained ResNet-50 [20] as the feature extractor for object recognition tasks (i.e., Office-31 [22], Office-Caltech [18] and Office-Home [46]). Theoverallframeworkis trained under an end-to-end manner via back-propagation. The stochastic gradient descent with momentum value as 0.9 is employed as the network optimizer. The initial learning rates for feature extractor and bottleneck layer are respectively set as 10 3 and 10 2, while the parameters of classifier are frozen. It is exponentially decayed as the training process.