FINERS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

Jun-17-2026, 18:22:29 GMT–Neural Information Processing Systems

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images--particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose FINERS, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. FINERS adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration.

large language model, machine learning, segmentation, (19 more...)

Neural Information Processing Systems

Jun-17-2026, 18:22:29 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.46)
- Europe (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found