Gradient Descent
A Proof of Theorem 1, w t, and w
Let ŵ be this arg min, which is unique since the objective is strongly convex. Substituting the definition of p and rearranging completes the proof. Lemma 2. Let l(; z) be H-smooth, convex, and non-negative for each z, let the stochastic gradient For the first term on the right hand side, we note that due to the algorithm's projections, all of the Lemma 3. Let l(; z) be H-smooth and non-negative for all z and let L This follows almost immediately from [Theorem 2.1.5 This proof is based on similar ideas as the proof of Lemma 5 and Theorem 2 due to Lan [17]. The key difference is that Lan considers a setting in which the variance of the stochastic gradients are uniformly bounded, while in our setting, we do not directly assume any bound on this quantity.
A Detailed comparisons with related work
In Table 1, we compare our agnostic learning results. Our results in this setting come from Theorem 3.3. We note that the sample complexity for Diakonikolas et al. To prove Lemma 3.5, we use the following result of Y ehudai and Shamir [35]. We first consider the case when σ satisfies Assumption 3.1.