Why Do We Need Weight Decay in Modern Deep Learning?