Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes