Why (and When) does Local SGD Generalize Better than SGD?

Open in new window