MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
Sun, Kai, Bai, Yushi, Yang, Zhen, Zhang, Jiajie, Qi, Ji, Hou, Lei, Li, Juanzi
–arXiv.org Artificial Intelligence
Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT -4o. We further conduct ablation studies to analyze three key factors: hard negative types, the efficiency of image-based negatives, and training configurations. These analyses yield important insights into optimizing the training pipeline of vision encoder for fine-grained geometric reasoning tasks. Geometric mathematical reasoning has garnered significant attention as an essential capability for large multimodal models (Anthropic, 2024; OpenAI, 2023; Bai et al., 2023). It requires fine-grained identification of visual elements (Lu et al., 2023) within the given images, such as geometric shapes, spatial configurations, and the relationships between them (He et al., 2024). However, the "eyes" of most existing LMMs, i.e., their pretrained vision encoders such as CLIP (Patel et al., 2024; Y ang et al., 2023; Goel et al., 2022), are primarily trained on general visual datasets that do not emphasize the intricate features necessary for specialized mathematical reasoning. Therefore, these models often fail to understand the nuanced geometric information accurately and produce incorrect reasoning and answers. As shown in Figure 1, facing a simple parallel line problem, the leading LMMs such as GPT -4o (OpenAI, 2024a), Claude-3 (Anthropic, 2024), and Qwen2.5-VL
arXiv.org Artificial Intelligence
Oct-2-2025