Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW

Open in new window