Referencing Where to Focus: Improving Visual Grounding with Referential Query Yabing Wang
–Neural Information Processing Systems
Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels.
Neural Information Processing Systems
Mar-20-2025, 15:26:53 GMT
- Country:
- Asia > China
- Heilongjiang Province (0.14)
- Shaanxi Province (0.14)
- Asia > China
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.68)
- Research Report
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks (0.93)
- Statistical Learning (0.67)
- Natural Language (1.00)
- Vision (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence