Towards Understanding the Generalizability of Delayed Stochastic Gradient Descent