bit Shampoo for Memory-Efficient Network Training
–Neural Information Processing Systems
Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage.
Neural Information Processing Systems
May-25-2025, 19:52:38 GMT