On architectural choices in deep learning: From network structure to gradient convergence and parameter estimation