HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

Open in new window