SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages