SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
–Neural Information Processing Systems
Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated.
Neural Information Processing Systems
Dec-27-2025, 10:11:21 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Vision (0.64)
- Information Technology > Artificial Intelligence