RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Oct-10-2024, 22:24:00 GMT–Neural Information Processing Systems

Existing object detection frameworks are usually built on a single format of object/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in CornerNet. While these different representations usually drive the frameworks to perform well in different aspects, e.g., better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer \cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of \emph{key} instances to strengthen the main \emph{query} representation features in the vanilla detectors. Novel techniques are proposed towards efficient computation of the decoder module, including a \emph{key sampling} approach and a \emph{shared location embedding} approach.

bridging visual representation, representation, transformer decoder, (6 more...)

Neural Information Processing Systems

Oct-10-2024, 22:24:00 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Data Science (0.76)
  - Artificial Intelligence > Vision (0.67)