Revisiting Reliability in Large-Scale Machine Learning Research Clusters