Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training