transmatcher
TransMatcher: DeepImageMatchingThrough TransformersforGeneralizablePerson Re-identification: Appendix
Some algorithms perform unstably across different runs, thus the average among several runsisamorestablemeasure. Using a unified measure is convenient, concise, and space-saving for ablation study and parameteranalysis. HereH = hand W = w,but to be clear,let'sdenote them differently. Then in Eq. (7), GMP is applied along the last dimension ofhw elements, resulting in a vector of sizeHW. Third, the proposed method has already considered the efficiency,with itssimplified decoder and balanced parameter selection, and thus it is the most efficient one in cross-matching Transformers as shown in Table 2 of the main paper.
TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification
Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification
Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. The latter improves the performance, but it is still limited.