M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation