Review for NeurIPS paper: Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes

Neural Information Processing Systems 

Additional Feedback: I'd like to see main paper Figure 1 / supplementary figure 4.1 expanded. The two questions I have that I don't think the figure currently answers are (1) how does the variance in final \sigma {2}_{f} across trials compare to a full batch GP, and (2) if full batch GPs have smaller variance, do much larger batch sizes (e.g., say m 1000) decrease this variance further? In figure 4.1, it does not seem the variance decreases much from m 16 to m 64 -- it'd be nice to know whether the batch size is the source of the variance. If it is, then running with very large batch sizes even up to m 10000 may not be too challenging. To the point of running large batch sizes, while the ability to use SGD will clearly outperform full batch training at some size N (at a guess, probably somewhere in the the N 100k-500k range), I don't think the results in Table 1 are necessarily representative of the settings you might actually want to run sgGP or EGP with.