Goto

Collaborating Authors

 transmatcher


TransMatcher: DeepImageMatchingThrough TransformersforGeneralizablePerson Re-identification: Appendix

Neural Information Processing Systems

Some algorithms perform unstably across different runs, thus the average among several runsisamorestablemeasure. Using a unified measure is convenient, concise, and space-saving for ablation study and parameteranalysis. HereH = hand W = w,but to be clear,let'sdenote them differently. Then in Eq. (7), GMP is applied along the last dimension ofhw elements, resulting in a vector of sizeHW. Third, the proposed method has already considered the efficiency,with itssimplified decoder and balanced parameter selection, and thus it is the most efficient one in cross-matching Transformers as shown in Table 2 of the main paper.


TransMatcher: DeepImageMatchingThrough TransformersforGeneralizablePerson Re-identification

Neural Information Processing Systems

Thelatter improves the performance, but it is still limited. This implies that the attention mechanism inTransformers isprimarily designed forglobal feature aggregation, which is not naturally suitable for image matching.


TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

Neural Information Processing Systems

Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.




TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

Neural Information Processing Systems

Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. The latter improves the performance, but it is still limited.