correlated
Supplementary File for " Stochastic Gradient Descent in Correlated Settings: AStudy on Gaussian Processes "
The supplementary file is organized as follows: Section 1 restates the assumptions and main theorems on the convergence of parameter iterates and the full gradient; Section 2 is devoted to the proofs of the two main theorems, while Section 3 includes the proofs of supporting lemmas; Section 4 includes additional figures from the numerical study. Under Assumptions 1.1 to 1.3, when m > C for some constant C > 0, we have the following results under two corresponding conditions on sl(m): First we present the following lemma, showing that the loss function has a property similar from strong convexity. For the first case discussed in Lemma 2.1, define eg(θ(k)) = (g(θ(k)))2, and for the second case define eg(θ(k)) = g(θ(k)). Therefore, combining Lemma 2.1, Lemma 2.2 and (7) leads to the following conclusion. Apply(15)inLemma 2.3 with = 12, then for any 0<α<1, with probability at least 1 2m α, we have A11 1 Under this case, we can still apply (15) in Lemma 2.3.
Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes
Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ up to a statistical error term depending on the minibatch size. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.
Supplementary File for " Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes "
The supplementary file is organized as follows: Section 1 restates the assumptions and main theorems on the convergence of parameter iterates and the full gradient; Section 2 is devoted to the proofs of the two main theorems, while Section 3 includes the proofs of supporting lemmas; Section 4 includes additional figures from the numerical study. Under Assumptions 1.1 to 1.3, when m > C for some constant C > 0, we have the following results under two corresponding conditions on s First we present the following lemma, showing that the loss function has a property similar from strong convexity. For the first case discussed in Lemma 2.1, define null g (θ ( k 1) (k 1) ( k 1) (k 1) ( k 1) (k 1) Therefore, combining Lemma 2.1, Lemma 2.2 and (7) leads to the following conclusion. Proof of Theorem 2. We start from bounding null Under this case, we can still apply (15) in Lemma 2.3. The following proof of this claim is very similar to the proof of Lemma 5.2 in [2].
Review for NeurIPS paper: Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes
Additional Feedback: I'd like to see main paper Figure 1 / supplementary figure 4.1 expanded. The two questions I have that I don't think the figure currently answers are (1) how does the variance in final \sigma {2}_{f} across trials compare to a full batch GP, and (2) if full batch GPs have smaller variance, do much larger batch sizes (e.g., say m 1000) decrease this variance further? In figure 4.1, it does not seem the variance decreases much from m 16 to m 64 -- it'd be nice to know whether the batch size is the source of the variance. If it is, then running with very large batch sizes even up to m 10000 may not be too challenging. To the point of running large batch sizes, while the ability to use SGD will clearly outperform full batch training at some size N (at a guess, probably somewhere in the the N 100k-500k range), I don't think the results in Table 1 are necessarily representative of the settings you might actually want to run sgGP or EGP with.
Review for NeurIPS paper: Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes
All the reviewers agree that the paper presents a worthwhile theoretical contribution, which may facilitate/motivate further work to tackle more challenging problems. The main limitation of the work is its practical impact as the proposed analysis does not apply to the lengthscales. Although R3 stands by their comments, they expressed their willingness to accept and recognized, during discussions, this work as an excellent attempt at the problem. Overall, I believe the NeurIPS community will benefit from this work and recommend the authors to take the reviewers' suggestions and comments into consideration.
Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes
Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full loss function, and recovers model hyperparameters with rate O(\frac{1}{K}) up to a statistical error term depending on the minibatch size. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.
Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape
Chen, Tiejin, Huang, Wenwang, Pang, Linsey, Luo, Dongsheng, Wei, Hua
This paper delves into the critical area of deep learning robustness, challenging the conventional belief that classification robustness and explanation robustness in image classification systems are inherently correlated. Through a novel evaluation approach leveraging clustering for efficient assessment of explanation robustness, we demonstrate that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness. To deeply investigate this contradiction, a groundbreaking training method designed to adjust the loss landscape with respect to explanation loss is proposed. Through the new training method, we uncover that although such adjustments can impact the robustness of explanations, they do not have an influence on the robustness of classification. These findings not only challenge the prevailing assumption of a strong correlation between the two forms of robustness but also pave new pathways for understanding relationship between loss landscape and explanation loss.