HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment