OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

Mar-22-2026, 22:51:18 GMT–Neural Information Processing Systems

Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails to capture the nuanced referential relationship between image-text in referring tasks.

artificial intelligence, machine learning, natural language, (8 more...)

Neural Information Processing Systems

Mar-22-2026, 22:51:18 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.79)
  - Machine Learning (0.59)